🤖 RAG Pattern — Complete Guide

Retrieval-Augmented Generation: Giving LLMs a reliable memory and ending hallucinations.

NOTE

Prerequisite: This guide assumes you understand Vector Databases → and Embeddings →. Read those first if you haven't!

What is RAG?

RAG (Retrieval-Augmented Generation) is an architecture that connects a Large Language Model (LLM) to a private or external database.

Instead of relying on the LLM's internal memory (which might be outdated or hallucinate), RAG searches your database for relevant facts first, and then forces the LLM to base its answer strictly on those facts.

The Problem RAG Solves

Standard LLMs (like GPT-4 or Claude) have three major flaws:

Knowledge Cutoff: They don't know about events that happened after they were trained.
Private Data: They don't know your company's proprietary documents, user data, or current inventory.
Hallucinations: If they don't know the answer, they will often confidently invent a fake one.

RAG fixes all three by treating the LLM not as a database of facts, but as a reasoning engine that processes the facts you give it.

🏗️ The RAG Architecture

A standard RAG pipeline has two distinct phases: Data Ingestion (done ahead of time) and Retrieval & Generation (done at query time).

🔄 How It Works: Step by Step

Let's look at what actually happens when a user asks: "What is the company's refund policy?"

1. The User Asks a Question

The user submits: What is the company's refund policy?

2. Embed the Question

The question is sent to an embedding model (like text-embedding-3-small) to turn it into a vector: [0.12, 0.44, 0.81...].

3. Retrieve Context

The vector database searches for the closest matching vectors and returns the original text chunks:

Result 1 (Score 0.95): "Refunds are allowed within 30 days of purchase..."Result 2 (Score 0.88): "Return shipping is paid by the customer..."

4. Augment the Prompt

The application takes the user's question AND the retrieved text, and combines them into a strict prompt:

text

System: You are a helpful assistant. Answer the user's question based ONLY on the provided context. If the answer is not in the context, say "I don't know".

Context:
1. Refunds are allowed within 30 days of purchase.
2. Return shipping is paid by the customer.

Question: What is the company's refund policy?

5. Generation

The LLM reads the combined prompt and generates a highly accurate, grounded response: "The company allows refunds within 30 days of purchase, but you must pay for return shipping."

💻 Full Code Example (Node.js)

Here is a complete, working RAG implementation using OpenAI and a generic Vector DB client.

javascript

import { OpenAI } from "openai";
import { getVectorDbClient } from "./my-db"; // e.g., Pinecone, Qdrant

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const db = getVectorDbClient();

async function answerWithRAG(userQuestion) {
  // 1. Embed the user's question
  const embedRes = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: userQuestion,
  });
  const queryVector = embedRes.data[0].embedding;

  // 2. Retrieve relevant context from Vector DB
  const searchResults = await db.search({
    vector: queryVector,
    limit: 3, // Top 3 most relevant chunks
  });

  // Combine retrieved text into a single string
  const contextText = searchResults
    .map((result) => result.payload.text)
    .join("\n\n---\n\n");

  // 3. Construct prompt and call LLM
  const chatRes = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "system",
        content: `You are an expert assistant. 
Answer the user's question using ONLY the provided context. 
If the context does not contain the answer, say "I don't have enough information."
Do not use outside knowledge.`,
      },
      {
        role: "user",
        content: `Context:\n${contextText}\n\nQuestion: ${userQuestion}`,
      },
    ],
  });

  // 4. Return the grounded answer
  return chatRes.choices[0].message.content;
}

// Usage:
// const answer = await answerWithRAG("How do I reset my API key?");

🚀 Advanced RAG Techniques

Basic RAG (shown above) works great for simple apps, but fails in complex scenarios. To build production-grade RAG, you need advanced techniques.

1. Hybrid Search (Keyword + Semantic)

Semantic search (vectors) is bad at exact matches (like names, IDs, or acronyms). Hybrid search combines Vector Search with traditional Keyword Search (BM25 / Elasticsearch) and merges the results.

2. Reranking (Cross-Encoders)

Vector DBs use "Bi-encoders", which are fast but not perfectly accurate. After retrieving the top 20 results from the Vector DB, you pass them through a Cross-Encoder (like Cohere Rerank) to perfectly re-score and re-order the top 5 results before giving them to the LLM.

3. Query Expansion / Rewriting

Users often ask bad or ambiguous questions. You can use a fast LLM to rewrite the user's query into a better search query before hitting the vector database. User: "Why does it crash?" Rewritten Query: "Why does the application crash on startup in production?"

4. Document Chunking Strategies

How you split your documents drastically affects RAG quality.

Fixed Size: Split every 500 words. (Simple, but might cut a sentence in half).
Semantic Split: Split by markdown headers or paragraphs. (Better context).
Parent-Child (Hierarchical): Embed small 100-word chunks for highly accurate search, but when a match is found, pass the entire 1000-word parent document to the LLM to give it full context.

⚖️ RAG vs. Fine-Tuning vs. Long Context

When should you use RAG versus other methods of giving an LLM knowledge?

Feature	RAG (Retrieval)	Fine-Tuning	Long Context (e.g., Gemini 1.5 Pro)
Best For	Fact retrieval, specific docs	Changing tone, format, style	Analyzing a single huge document
Updating Data	Instant (just add to DB)	Slow (retrain model)	Instant (just paste it in)
Cost	Low (cheap queries)	High (training costs)	High (massive prompt tokens)
Hallucinations	Very Low (grounded)	High (can easily hallucinate)	Low (reads the whole doc)

Industry Mantra: "Fine-tune for form, RAG for facts." Do not use fine-tuning to teach an LLM new information; use RAG.

🛡️ RAG Evaluation (The RAG Triad)

How do you know if your RAG system is actually good? Frameworks like TruLens or Ragas evaluate RAG using three metrics:

Context Relevance: Did the Vector DB return the right documents? (Or did it return useless junk?)
Groundedness: Did the LLM base its answer only on the retrieved context? (Or did it hallucinate?)
Answer Relevance: Did the final answer actually solve the user's question?

✅ Checklist Before Moving On

[ ] I understand the difference between the LLM and the Vector DB in RAG.
[ ] I can draw the architecture of the Ingestion phase vs the Query phase.
[ ] I know why we put the retrieved context into the system prompt.
[ ] I understand why RAG is better than Fine-Tuning for adding facts.
[ ] I am familiar with advanced concepts like Hybrid Search and Reranking.

📚 Further Reading

Building RAG Apps (LangChain) — Great practical tutorial.
Advanced RAG Techniques (Pinecone) — Deep dive into chunking, reranking, and hybrid search.
RAG vs Fine Tuning — OpenAI's perspective on when to use which.

➡️ Next: Level 4 — Caching

🤖 RAG Pattern — Complete Guide ​

What is RAG? ​

The Problem RAG Solves ​

🏗️ The RAG Architecture ​

🔄 How It Works: Step by Step ​

1. The User Asks a Question ​

2. Embed the Question ​

3. Retrieve Context ​

4. Augment the Prompt ​

5. Generation ​

💻 Full Code Example (Node.js) ​

🚀 Advanced RAG Techniques ​

1. Hybrid Search (Keyword + Semantic) ​

2. Reranking (Cross-Encoders) ​

3. Query Expansion / Rewriting ​

4. Document Chunking Strategies ​

⚖️ RAG vs. Fine-Tuning vs. Long Context ​

🛡️ RAG Evaluation (The RAG Triad) ​

✅ Checklist Before Moving On ​

📚 Further Reading ​