Retrieval Augmented Generation
Retrieval Augmented Generation (RAG) enhances language model responses by retrieving relevant documents from an external knowledge base and providing them as context. It grounds the model’s output in specific, up-to-date, or private information.
Why RAG?
LLM limitations RAG addresses:
- Knowledge cutoff: LLMs don’t know events after their training data.
- Hallucination: grounding in retrieved text reduces fabrication.
- Private data: company documents, customer records are not in pretraining data.
- Source attribution: retrieved passages can be cited.
- Cost: cheaper to update a knowledge base than to retrain or fine-tune a model.
Basic RAG Pipeline
User query
→ Query encoder (embed query)
→ Retrieve top-k documents from vector store
→ Augment prompt: [system] + [retrieved docs] + [user query]
→ LLM generates answer
→ Return answer (optionally with source citations)
Indexing Phase (Offline)
Document collection: gather all knowledge base documents (PDFs, web pages, internal wikis, databases).
Chunking: split documents into chunks of 200-1000 tokens. Chunk size tradeoff: larger chunks provide more context but are less precise; smaller chunks are more targeted but may miss context.
Chunk strategies:
- Fixed token size with overlap.
- Sentence boundary splitting.
- Semantic chunking: split at topic shift boundaries.
- Hierarchical: store both sentence-level and paragraph-level chunks.
Encoding: embed each chunk with a text embedding model.
Vector store: index embeddings for approximate nearest neighbor (ANN) search. FAISS, Pinecone, Weaviate, Qdrant, PGVector.
Retrieval Phase
Dense retrieval: embed the query; find top-$k$ chunks by cosine similarity.
Sparse retrieval (BM25): keyword-based TF-IDF scoring. Fast; no embeddings needed. Good for exact keyword matches.
Hybrid retrieval: combine dense and sparse scores (RRF: Reciprocal Rank Fusion is the standard combination method).
\[\text{RRF}(d) = \sum_{r \in R} \frac{1}{k + r(d)}\]where $r(d)$ is the rank of document $d$ in ranker $r$, and $k = 60$ typically.
Reranking: after retrieving top-$k$ (e.g., $k=20$), re-rank with a cross-encoder (which sees the query and document jointly). Returns top-$m$ ($m < k$) for the LLM context. Cross-encoders are much more accurate but too slow for full corpus retrieval.
Embedding Models
| Model | Dimension | Notes |
|---|---|---|
| OpenAI text-embedding-3-large | 3072 | Strong; proprietary |
| Cohere embed-v3 | 1024 | Multilingual; strong |
| BGE-large-en | 1024 | Strong open-source |
| E5-mistral-7b | 4096 | LLM-based embedding; state of the art |
| Nomic-embed-text | 768 | Open; context-length flexible |
Matryoshka Representation Learning (MRL): train embeddings so that the first $d’$ dimensions are also useful for $d’ < d$. Allows dynamic dimensionality reduction without quality loss. Used in OpenAI text-embedding-3 models.
Advanced RAG Patterns
Query Transformation
HyDE (Hypothetical Document Embeddings): generate a hypothetical answer to the query; embed it; retrieve documents similar to the hypothetical answer. Often improves recall because the hypothetical answer is closer to actual document content than the raw question.
Multi-query: generate $n$ paraphrases of the query; retrieve for each; union the results. Improves recall at the cost of extra LLM calls.
Step-back prompting: reformulate a specific question into a more general one; retrieve high-level background context; then answer the original question with the context.
Contextual Retrieval (Anthropic 2024)
Prepend a context summary to each chunk before embedding:
Context: This chunk is from a Q3 2024 earnings report for Acme Corp, describing operating margins.
[Original chunk text...]
Dramatically improves retrieval quality for chunks that are ambiguous out of context.
Self-RAG
The model decides when to retrieve (generates a retrieval token), critiques retrieved documents (relevance token), and critiques its own response (support and utility tokens). More accurate but requires a specially fine-tuned model.
RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval)
Build a tree of summaries. Cluster leaf chunks; summarize each cluster; cluster the summaries; repeat. At retrieval, query both leaf and summary levels. Captures both local detail and global structure.
RAG Evaluation
Retrieval metrics: context precision (fraction of retrieved chunks that are relevant), context recall (fraction of relevant chunks retrieved).
Generation metrics: faithfulness (does the answer follow from the context?), answer relevance (does the answer address the question?).
RAGAS: framework that computes faithfulness, answer relevance, context precision, and context recall using an LLM as a judge.
RAG vs. Fine-tuning
| Aspect | RAG | Fine-tuning |
|---|---|---|
| Knowledge update | Instant (update the index) | Requires retraining |
| Factual accuracy | Higher (grounded) | Can hallucinate |
| Private data | Natural fit | Training data privacy concerns |
| Reasoning style | Unchanged | Can be adapted |
| Cost | Retrieval + inference | Training + inference |
| Latency | Higher (retrieval adds delay) | Lower |
In practice: use RAG for dynamic, factual, or private knowledge; use fine-tuning for consistent output style, specialized vocabulary, or task-specific behavior.