RAG systems combine the power of retrieval mechanisms and language models, and enable them to generate contextually relevant and well-grounded responses. However, evaluating the performance and identifying potential failure modes of RAG systems can be a very hard.
Hence, the RAG Triad – a triad of metrics that provide three main steps of a RAG system's execution: Context Relevance, Groundedness, and Answer Relevance. In this blog post, I'll go through the intricacies of the RAG Triad, and guide you through the process of setting up, executing, and analyzing the evaluation of a RAG system.
Introduction to the RAG Triad:
At the heart of every RAG system lies a delicate balance between retrieval and generation. The RAG Triad provides a comprehensive framework to evaluate the quality and potential failure modes of this delicate balance. Let's break down the three components.
A. Context Relevance:
Imagine expected to answer a question, but the information you've been provided is completely unrelated. That's precisely what a RAG system aims to avoid. Context Relevance assesses the quality of the retrieval process by evaluating how relevant each piece of retrieved context is to the original query. By scoring the relevance of the retrieved context, we can identify potential issues in the retrieval mechanism and make the necessary adjustments.
B. Groundedness:
Have you ever had a conversation where someone seemed to be making up facts or providing information with no solid foundation? That's the equivalent of a RAG system lacking groundedness. Groundedness evaluates whether the final response generated by the system is well-grounded in the retrieved context. If the response contains statements or claims that are not supported by the retrieved information, the system may be hallucinating or relying too heavily on its pre-training data, leading to potential inaccuracies or biases.
C. Answer Relevance:
Imagine asking for directions to the nearest coffee shop and receiving a detailed recipe for baking a cake. That's the kind of situation Answer Relevance aims to prevent. This component of the RAG Triad evaluates whether the final response generated by the system is truly relevant to the original query. By assessing the relevance of the answer, we can identify instances where the system may have misunderstood the question or strayed from the intended topic.
Setting up the RAG Triad Evaluation
Before we can dive into the evaluation process, we need to lay the groundwork. Let's walk through the necessary steps to set up the RAG Triad evaluation.
A. Importing Libraries and Establishing API Keys:
First things first, we need to import the required libraries and modules, including OpenAI's API key and LLM provider.
import warnings
warnings.filterwarnings('ignore')
import utils
import os
import openai
openai.api_key = utils.get_openai_api_key()
from trulens_eval import Tru
B. Loading and Indexing the Document Corpus:
Next, we'll load and index the document corpus that our RAG system will be working with. In our case, we'll be using a PDF document on "How to Build a Career in AI" by Andrew NG.
from llama_index import SimpleDirectoryReader
documents = SimpleDirectoryReader(
input_files=["./eBook-How-to-Build-a-Career-in-AI.pdf"]
).load_data()
C. Defining the Feedback Functions:
At the core of the RAG Triad evaluation are the feedback functions – specialized functions that assess each component of the triad. Let's define these functions using the TrueLens library.
from llama_index.llms import OpenAI
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
# Answer Relevance
from trulens_eval import Feedback
f_qa_relevance = Feedback(
provider.relevance_with_cot_reasons,
name="Answer Relevance"
).on_input_output()
# Context Relevance
import numpy as np
f_qs_relevance = (
Feedback(provider.qs_relevance_with_cot_reasons,
name="Context Relevance")
.on_input()
.on(context_selection)
.aggregate(np.mean)
)
# Groundedness
from trulens_eval.feedback import Groundedness
grounded = Groundedness(groundedness_provider=provider)
f_groundedness = (
Feedback(grounded.groundedness_measure_with_cot_reasons,
name="Groundedness"
)
.on(context_selection)
.on_output()
.aggregate(grounded.grounded_statements_aggregator)
)
Executing the RAG Application and Evaluation
With the setup complete, it's time to put our RAG system and the evaluation framework into action. Let's walk through the steps involved in executing the application and recording the evaluation results.
A. Preparing the Evaluation Questions:
First, we'll load a set of evaluation questions that we want our RAG system to answer. These questions will serve as the basis for our evaluation process.
eval_questions = []
with open('eval_questions.txt', 'r') as file:
for line in file:
item = line.strip()
eval_questions.append(item)
B. Running the RAG Application and Recording Results:
Next, we'll set up the TruLens recorder, which will help us record the prompts, responses, and evaluation results in a local database.
from trulens_eval import TruLlama
tru_recorder = TruLlama(
sentence_window_engine,
app_id="App_1",
feedbacks=[
f_qa_relevance,
f_qs_relevance,
f_groundedness
]
)
for question in eval_questions:
with tru_recorder as recording:
sentence_window_engine.query(question)
As the RAG application runs on each evaluation question, the TruLens recorder will diligently capture the prompts, responses, intermediate results, and evaluation scores, storing them in a local database for further analysis.
Analyzing the Evaluation Results
With the evaluation data at our fingertips, it's time to look into the analysis and get the insights. Let's look at various ways we can analyze the results and identify potential areas for improvement.
A. Examining Individual Record-Level Results:
Sometimes, the devil is in the details. By examining individual record-level results, we can gain a deeper understanding of the strengths and weaknesses of our RAG system.
records, feedback = tru.get_records_and_feedback(app_ids=[])
records.head()
This code snippet gives us access to the prompts, responses, and evaluation scores for each individual record, allowing us to identify specific instances where the system may have struggled or excelled.
B. Viewing Aggregate Performance Metrics:
Let's take a step back and look at the bigger picture. The TrueLens library provides us with a leaderboard that aggregates the performance metrics across all records, giving us a high-level view of our RAG system's overall performance.
tru.get_leaderboard(app_ids=[])
This leaderboard displays the average scores for each component of the RAG Triad, along with metrics such as latency and cost. By analyzing these aggregate metrics, we can identify trends and patterns that may not be apparent at the record level.
C. Exploring the TrueLens Streamlit Dashboard:
In addition to the CLI, TrueLens also offers a Streamlit dashboard that provides a GUI to explore and analyze the evaluation results. With a few simple commands, we can launch the dashboard.
tru.run_dashboard()
Once the dashboard is up and running, we see a comprehensive overview of our RAG system's performance. At a glance, we can see the aggregate metrics for each component of the RAG Triad, as well as latency and cost information.
By selecting our application from the dropdown menu, we can access a detailed record-level view of the evaluation results. Each record is neatly displayed, complete with the user's input prompt, the RAG system's response, and the corresponding scores for Answer Relevance, Context Relevance, and Groundedness.
Clicking on an individual record reveals more insights. We can explore the chain of thought reasoning behind each evaluation score, explaining the thought process of the language model performing the evaluation. This level of transparency is useful for to identifying potential failure modes and areas for improvement.
Let's say we come across a record where the Groundedness score is low. By seeing the details, we may discover that the RAG system's response contains statements that are not well-grounded in the retrieved context. The dashboard will show us exactly which statements are lacking supporting evidence, allowing us to pinpoint the root cause of the issue.
The TrueLens Streamlit dashboard is more than just a visualization tool. By using it's interactive capabilities and data-driven insights, we can make informed decisions and take targeted actions to enhance the performance of our applications.
Advanced RAG Techniques and Iterative Improvement
A. Introducing the Sentence Window RAG Technique:
One advanced technique is the Sentence Window RAG, which addresses a common failure mode of RAG systems: limited context size. By increasing the context window size, the Sentence Window RAG aims to provide the language model with more relevant and comprehensive information, potentially improving the system's Context Relevance and Groundedness.
B. Re-evaluating with the RAG Triad:
After implementing the Sentence Window RAG technique, we can put it to the test by re-evaluating it using the same RAG Triad framework. This time, we'll focus our attention on the Context Relevance and Groundedness scores, looking for improvements in these areas as a result of the increased context size.
# Set up the Sentence Window RAG
sentence_index = build_sentence_window_index(
document,
llm,
embed_model="local:BAAI/bge-small-en-v1.5",
save_dir="sentence_index"
)
sentence_window_engine = get_sentence_window_query_engine(sentence_index)
# Re-evaluate with the RAG Triad
for question in eval_questions:
with tru_recorder as recording:
sentence_window_engine.query(question)
C. Experimenting with Different Window Sizes:
While the Sentence Window RAG technique can potentially improve performance, the optimal window size may vary depending on the specific use case and dataset. Too small a window size may not provide enough relevant context, while too large a window size could introduce irrelevant information, impacting the system's Groundedness and Answer Relevance.
By experimenting with different window sizes and re-evaluating using the RAG Triad, we can find the sweet spot that balances context relevance with groundedness and answer relevance, ultimately leading to a more robust and reliable RAG system.
Conclusion:
The RAG Triad, comprising Context Relevance, Groundedness, and Answer Relevance, has proven to be a useful framework for evaluating the performance and identifying potential failure modes of Retrieval-Augmented Generation systems.