Evaluating RAG Quality: Implementing Faithfulness and Answer Relevance Metrics

Building a Retrieval-Augmented Generation (RAG) pipeline is the easy part. Making sure that pipeline isn’t hallucinating or giving users irrelevant fluff is where most engineering teams hit a wall. I’ve spent the last few months iterating on evaluation frameworks, and I’ve learned that relying on vibes-based testing—where you manually ask a few questions and hope for the best—is a recipe for production failure.

To build a robust system, you need to quantify your performance. The two metrics that matter most are Faithfulness (is the answer derived solely from the retrieved context?) and Answer Relevance (does the answer actually address the user's prompt?).

The "LLM-as-a-Judge" Approach

I’ve moved away from traditional NLP metrics like BLEU or ROUGE. They don't capture semantic meaning in RAG. Instead, I use an "LLM-as-a-judge" pattern. You essentially use a high-performing model (like GPT-4o or Claude 3.5 Sonnet) to evaluate the output of your RAG chain.

The logic is straightforward:

Faithfulness: Extract statements from the generated answer and check if they can be inferred from the retrieved chunks.
Answer Relevance: Compare the generated answer against the user query to ensure the intent is met, ignoring the context entirely for this step.

Implementation: Evaluating with Ragas

I prefer using the ragas library for this because it handles the complex prompt engineering required to get consistent scores from judge models. Here is how I structure the evaluation script in my current project.

import os
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevance
from datasets import Dataset

# I keep my evaluation data in a standard format
# questions: the user prompt
# answers: the generated response from the RAG pipeline
# contexts: the list of retrieved chunks
data = {
    "question": ["How do I reset my password?"],
    "answer": ["You can reset it via the settings page."],
    "contexts": [["Users can reset passwords by navigating to Settings > Security."]]
}

dataset = Dataset.from_dict(data)

# I use a production-grade LLM as the judge
# Note: Ensure you have your API keys set in your environment
def run_evaluation(dataset):
    print("Starting evaluation run...")
    
    # Ragas handles the prompt orchestration under the hood
    results = evaluate(
        dataset,
        metrics=[faithfulness, answer_relevance],
    )
    
    return results

# Running the evaluation and outputting scores
if __name__ == "__main__":
    evaluation_results = run_evaluation(dataset)
    print(f"Faithfulness Score: {evaluation_results['faithfulness']}")
    print(f"Answer Relevance Score: {evaluation_results['answer_relevance']}")

Architectural Trade-offs

When you integrate this into your CI/CD pipeline, you face a few operational hurdles:

Latency: Running an evaluation loop is slow. I don't run this on every commit. Instead, I trigger it against a "Golden Dataset" of 50-100 questions whenever I update the retrieval logic or the prompt template.
Cost: Using a model like GPT-4o to judge every request is expensive. For local development, I use a smaller model like llama-3-8b-instruct or mistral-nemo as the judge. It’s less accurate, but it’s cheap and gives me a directional signal before I push to staging.
The Context Window Trap: If you pass massive context blocks to your judge model, it might lose focus (the "lost in the middle" phenomenon). I recommend truncating your retrieved chunks to a maximum of 3-4 documents before running the evaluation.

Debugging Low Scores

If your faithfulness score is consistently low, your RAG system is likely hallucinating. My first step is always to check the retrieval quality. If the retrieved chunks don't contain the answer, the LLM will try to fill the gaps using its pre-trained weights.

If the chunks do contain the answer but the faithfulness score is still low, look at your system prompt. I often find that the prompt is too vague. I tighten it by adding: "Answer the question using only the provided context. If the answer is not present, state that you do not know. Do not use outside knowledge."

If the answer relevance is low, the issue is usually in the retrieval stage—specifically the embedding model or the chunking strategy. I’ve found that moving from simple fixed-size chunking to semantic chunking often fixes relevance issues by ensuring the retrieved context actually contains the answer's core intent.

Don't treat evaluation as a one-time task. It is a feedback loop. Every time I get a low score, I turn that specific query into a test case in my golden set. That is how you build a system that actually stays reliable as your data grows.

The "LLM-as-a-Judge" Approach

Implementation: Evaluating with Ragas

Architectural Trade-offs

Debugging Low Scores

Aditya Shenvi