🤖 RAG Pattern — Complete Guide
Retrieval-Augmented Generation: Giving LLMs a reliable memory and ending hallucinations.
NOTE
Prerequisite: This guide assumes you understand Vector Databases → and Embeddings →. Read those first if you haven't!
What is RAG?
RAG (Retrieval-Augmented Generation) is an architecture that connects a Large Language Model (LLM) to a private or external database.
Instead of relying on the LLM's internal memory (which might be outdated or hallucinate), RAG searches your database for relevant facts first, and then forces the LLM to base its answer strictly on those facts.
The Problem RAG Solves
Standard LLMs (like GPT-4 or Claude) have three major flaws:
- Knowledge Cutoff: They don't know about events that happened after they were trained.
- Private Data: They don't know your company's proprietary documents, user data, or current inventory.
- Hallucinations: If they don't know the answer, they will often confidently invent a fake one.
RAG fixes all three by treating the LLM not as a database of facts, but as a reasoning engine that processes the facts you give it.
🏗️ The RAG Architecture
A standard RAG pipeline has two distinct phases: Data Ingestion (done ahead of time) and Retrieval & Generation (done at query time).
🔄 How It Works: Step by Step
Let's look at what actually happens when a user asks: "What is the company's refund policy?"
1. The User Asks a Question
The user submits: What is the company's refund policy?
2. Embed the Question
The question is sent to an embedding model (like text-embedding-3-small) to turn it into a vector: [0.12, 0.44, 0.81...].
3. Retrieve Context
The vector database searches for the closest matching vectors and returns the original text chunks:
Result 1 (Score 0.95): "Refunds are allowed within 30 days of purchase..."Result 2 (Score 0.88): "Return shipping is paid by the customer..."
4. Augment the Prompt
The application takes the user's question AND the retrieved text, and combines them into a strict prompt:
System: You are a helpful assistant. Answer the user's question based ONLY on the provided context. If the answer is not in the context, say "I don't know".
Context:
1. Refunds are allowed within 30 days of purchase.
2. Return shipping is paid by the customer.
Question: What is the company's refund policy?5. Generation
The LLM reads the combined prompt and generates a highly accurate, grounded response: "The company allows refunds within 30 days of purchase, but you must pay for return shipping."
💻 Full Code Example (Node.js)
Here is a complete, working RAG implementation using OpenAI and a generic Vector DB client.
import { OpenAI } from "openai";
import { getVectorDbClient } from "./my-db"; // e.g., Pinecone, Qdrant
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const db = getVectorDbClient();
async function answerWithRAG(userQuestion) {
// 1. Embed the user's question
const embedRes = await openai.embeddings.create({
model: "text-embedding-3-small",
input: userQuestion,
});
const queryVector = embedRes.data[0].embedding;
// 2. Retrieve relevant context from Vector DB
const searchResults = await db.search({
vector: queryVector,
limit: 3, // Top 3 most relevant chunks
});
// Combine retrieved text into a single string
const contextText = searchResults
.map((result) => result.payload.text)
.join("\n\n---\n\n");
// 3. Construct prompt and call LLM
const chatRes = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{
role: "system",
content: `You are an expert assistant.
Answer the user's question using ONLY the provided context.
If the context does not contain the answer, say "I don't have enough information."
Do not use outside knowledge.`,
},
{
role: "user",
content: `Context:\n${contextText}\n\nQuestion: ${userQuestion}`,
},
],
});
// 4. Return the grounded answer
return chatRes.choices[0].message.content;
}
// Usage:
// const answer = await answerWithRAG("How do I reset my API key?");🚀 Advanced RAG Techniques
Basic RAG (shown above) works great for simple apps, but fails in complex scenarios. To build production-grade RAG, you need advanced techniques.
1. Hybrid Search (Keyword + Semantic)
Semantic search (vectors) is bad at exact matches (like names, IDs, or acronyms). Hybrid search combines Vector Search with traditional Keyword Search (BM25 / Elasticsearch) and merges the results.
2. Reranking (Cross-Encoders)
Vector DBs use "Bi-encoders", which are fast but not perfectly accurate. After retrieving the top 20 results from the Vector DB, you pass them through a Cross-Encoder (like Cohere Rerank) to perfectly re-score and re-order the top 5 results before giving them to the LLM.
3. Query Expansion / Rewriting
Users often ask bad or ambiguous questions. You can use a fast LLM to rewrite the user's query into a better search query before hitting the vector database. User: "Why does it crash?" Rewritten Query: "Why does the application crash on startup in production?"
4. Document Chunking Strategies
How you split your documents drastically affects RAG quality.
- Fixed Size: Split every 500 words. (Simple, but might cut a sentence in half).
- Semantic Split: Split by markdown headers or paragraphs. (Better context).
- Parent-Child (Hierarchical): Embed small 100-word chunks for highly accurate search, but when a match is found, pass the entire 1000-word parent document to the LLM to give it full context.
⚖️ RAG vs. Fine-Tuning vs. Long Context
When should you use RAG versus other methods of giving an LLM knowledge?
| Feature | RAG (Retrieval) | Fine-Tuning | Long Context (e.g., Gemini 1.5 Pro) |
|---|---|---|---|
| Best For | Fact retrieval, specific docs | Changing tone, format, style | Analyzing a single huge document |
| Updating Data | Instant (just add to DB) | Slow (retrain model) | Instant (just paste it in) |
| Cost | Low (cheap queries) | High (training costs) | High (massive prompt tokens) |
| Hallucinations | Very Low (grounded) | High (can easily hallucinate) | Low (reads the whole doc) |
Industry Mantra: "Fine-tune for form, RAG for facts." Do not use fine-tuning to teach an LLM new information; use RAG.
🛡️ RAG Evaluation (The RAG Triad)
How do you know if your RAG system is actually good? Frameworks like TruLens or Ragas evaluate RAG using three metrics:
- Context Relevance: Did the Vector DB return the right documents? (Or did it return useless junk?)
- Groundedness: Did the LLM base its answer only on the retrieved context? (Or did it hallucinate?)
- Answer Relevance: Did the final answer actually solve the user's question?
✅ Checklist Before Moving On
- [ ] I understand the difference between the LLM and the Vector DB in RAG.
- [ ] I can draw the architecture of the Ingestion phase vs the Query phase.
- [ ] I know why we put the retrieved context into the system prompt.
- [ ] I understand why RAG is better than Fine-Tuning for adding facts.
- [ ] I am familiar with advanced concepts like Hybrid Search and Reranking.
📚 Further Reading
- Building RAG Apps (LangChain) — Great practical tutorial.
- Advanced RAG Techniques (Pinecone) — Deep dive into chunking, reranking, and hybrid search.
- RAG vs Fine Tuning — OpenAI's perspective on when to use which.
➡️ Next: Level 4 — Caching
