Retrieval-Augmented Generation (RAG)

Back to Glossary Testing

An LLM approach that retrieves relevant information from a knowledge base before generating responses, grounding outputs in specific source documents.

Also known as: RAG, grounded generation

Overview

RAG combines information retrieval with LLM generation, allowing systems to reference specific documents or knowledge bases. Testing RAG systems requires evaluating both retrieval quality and how well the LLM uses retrieved context.

RAG Components

The retrieval step involves fetching relevant documents or passages from a knowledge base based on the user's query. This typically uses semantic search or vector similarity to find content that might answer the question. The augmentation step provides retrieved information to the LLM as additional context beyond its training data. The generation step has the LLM generate a response based on both the original query and the retrieved context. Finally, citation adds optional references indicating which sources were used, enabling verification and building trust.

Testing RAG Systems

Grounding accuracy tests whether responses stay grounded in retrieved context rather than adding information from the LLM's training. Does the system only make claims supported by the provided documents, or does it supplement with potentially incorrect information? Retrieval quality evaluation assesses whether the retrieval step finds relevant documents. Are the most useful documents being retrieved and ranked highly? Citation accuracy verifies that when sources are cited, they actually support the claims being made. Do citations point to passages that genuinely contain the referenced information?

Common RAG Testing Patterns

Testing context usage involves verifying the LLM actually uses retrieved information appropriately. Does it synthesize information from multiple retrieved documents? Does it ignore irrelevant retrieved content? Does it acknowledge when retrieved documents don't contain the answer? Testing against hallucination is particularly important for RAG—the system should admit when retrieved context doesn't answer the question rather than fabricating information. Create test cases where retrieval returns irrelevant documents to see if the LLM correctly identifies that it can't answer based on the provided context.

RAG-Specific Challenges

Chunk boundaries create issues when relevant information spans multiple chunks or gets split awkwardly during document segmentation. Test how your system handles information that requires combining content from multiple passages. Conflicting information arises when different retrieved documents contradict each other. Does your system identify conflicts, attempt to resolve them based on source authority, or inappropriately blend contradictory claims? Context window limits constrain how much retrieved information can be provided to the LLM. When retrieval returns many relevant documents, how does your system select which to include? Does it prioritize effectively?

Best Practices

For evaluating retrieval quality, assess relevance by checking whether retrieved documents actually answer the user's query. Evaluate coverage to ensure all necessary documents are retrieved, not just some. Verify ranking places the most relevant documents highest in results. Check diversity to confirm retrieval avoids returning redundant documents that all say the same thing.

For assessing generation quality, test grounding by verifying responses use only retrieved context without adding unsupported information. Evaluate completeness to ensure the response uses all relevant information from retrieved documents, not just the first passage. Check accuracy of extracted facts against source documents. Verify citations are properly referenced and point to passages that actually support the claims.

For overall system evaluation, assess answer quality by determining whether final answers are correct and helpful to users. Test for hallucination by checking whether the system adds information not present in retrieved context. Enable source attribution so users can verify claims by checking cited sources. Examine failure modes by testing what happens with poor retrieval—does the system gracefully handle cases where retrieval doesn't find relevant information?

Documentation

/platform/knowledge

Related Terms

Hallucination Ground Truth Knowledge