RAG series: ARAGOG

By 
Matouš Eibich
April 2, 2024

Introduction

During our development of Retrieval-Augmented Generation (RAG) systems for multiple clients, we recognized a significant gap in current research—while there’s increasing interest and literature reviews on RAG, there’s a noticeable lack of comprehensive experimental comparisons across the spectrum of advanced RAG methods. Addressing this, our study "ARAGOG: Advanced RAG Output Grading" aims to fill this void by evaluating various RAG techniques. Our work not only helps us be more knowledgeable for future client RAG projects but also contributes valuable insights to the open-source community.

Experiment Design

Our research tested a variety of advanced RAG techniques to explore their impact on enhancing Large Language Models (LLMs). The techniques evaluated include Sentence-window Retrieval, Document Summary Index, Hypothetical Document Embedding (HyDE), Multi-query, Maximal Marginal Relevance (MMR), Cohere Rerank, and LLM Rerank. Each of these methods was chosen for its potential to improve the precision and contextuality of information retrieval, a critical aspect of LLM performance. 

To assess the efficacy of these RAG techniques, we employed two primary metrics: Retrieval Precision and Answer Similarity. Retrieval Precision measures the relevance of the information retrieved by the system in response to a query, while Answer Similarity evaluates how closely the system's generated answers align with reference responses. For our experiments, we used a dataset drawn from the AI ArXiv collection, incorporating a variety of technical questions and more general inquiries to rigorously test the selected RAG systems.

Dataset preparation for the experiment

Findings

Our investigation into various RAG techniques revealed a nuanced performance across the methods studied. The Sentence Window Retrieval technique stood out for its high retrieval precision, demonstrating its effectiveness in accurately sourcing relevant information. However, its performance in terms of answer similarity varied, suggesting that while it excels in retrieval, the translation of this information into coherent answers could be improved. On the other hand, techniques like Hypothetical Document Embedding (HyDE) and LLM Rerank significantly enhanced retrieval precision without a need for re-indexing of a vector database, positioning them as valuable tools for improving the accuracy of LLM outputs. Notably, established methods such as Maximal Marginal Relevance (MMR) and Cohere Rerank did not show a marked advantage over the baseline Naive RAG system, indicating that their impact might be more context-dependent.

Boxplot of Retrieval Precision by Experiment. Each boxplot demonstrates the range and distribution of retrieval precision scores across different RAG techniques. Higher median values and tighterinterquartile ranges suggest better performance and consistency.

Conclusion

Our study, while comprehensive, is shaped by certain limitations, including the use of a singular dataset, a constrained set of questions, and evaluation with GPT-3.5-turbo, which may not showcase the full capabilities of more advanced models. Recognizing these constraints, we view our research as a foundational step in experimental RAG studies, rather than the final word. We've made our experimental pipeline openly available on GitHub, encouraging the scientific community to build on, refine, and critique our work. We invite further exploration and validation of our findings, aiming to collectively advance our understanding and application of RAG technologies in the field.

Learn more