🤖 LLMs & AI Agents — Full Deep Dive

Large Language Models and AI Agents are reshaping the way we build intelligent, autonomous systems. This guide explains what they are under the hood, how they work, and how to architect production systems around them — from a system design perspective.

📑 Table of Contents

What is a Large Language Model (LLM)?
How LLMs Work Internally
Transformer Architecture
LLM Inference Pipeline
Prompt Engineering Fundamentals
What is an AI Agent?
AI Agent Architecture
Agent Patterns & Topologies
Memory Systems in Agents
Tool Use & Function Calling
Multi-Agent Systems
RAG vs Fine-tuning vs Agents
Production System Design for LLMs
Key Trade-offs & Failure Modes
Glossary

What is a Large Language Model (LLM)?

A Large Language Model is a deep neural network trained on massive text corpora to predict the next token (word-piece) in a sequence. Through this deceptively simple training objective, the model learns grammar, facts, reasoning, code, and even emergent capabilities like analogical reasoning.

Key Properties

Property	Description
Parameters	Billions of learned weights (GPT-4 ~1.8T, Llama 3 70B = 70 billion)
Context Window	Maximum tokens the model can "see" at once (e.g. 128K for GPT-4o)
Tokenization	Text split into sub-word tokens via BPE / SentencePiece
Temperature	Controls randomness in sampling (0 = deterministic, 1+ = creative)
Emergent Behavior	Capabilities not explicitly trained — appear at scale thresholds

Training Phases

Phase 1 — Pre-training
  ├── Dataset: Trillions of tokens (web, books, code, papers)
  ├── Objective: Next-token prediction (Causal LM)
  └── Cost: Millions of $ in GPU compute

Phase 2 — Supervised Fine-tuning (SFT)
  ├── Dataset: High-quality instruction-response pairs
  ├── Objective: Follow human instructions accurately
  └── Teaches: Format, tone, task following

Phase 3 — RLHF (Reinforcement Learning from Human Feedback)
  ├── Reward Model: Humans rank model responses
  ├── RL Loop: PPO algorithm maximizes reward
  └── Result: Aligned, helpful, safe outputs

How LLMs Work Internally

Tokenization

Before any computation, text is split into tokens — sub-word units that balance vocabulary size with coverage.

Input:  "System Design"
Tokens: ["System", " Design"]   ← whitespace is part of the token
IDs:    [6rdm, 7281]             ← integer IDs looked up in an embedding table

Embeddings

Each token ID is converted to a high-dimensional vector (e.g. 4096 dimensions for a 7B model). The model learns that tokens with similar meanings have similar vector directions.

"king" - "man" + "woman" ≈ "queen"   ← classic word2vec result

Positional Encoding

Transformers have no inherent sense of order. Position is injected by adding positional encodings to embeddings — either fixed sinusoids (original Transformer) or learned Rotary Position Embeddings (RoPE) used by modern LLMs.

Transformer Architecture

The Transformer is the foundation of every modern LLM.

Self-Attention (The Core Mechanism)

Self-attention lets every token attend to every other token in the context. For each token, three vectors are computed:

Vector	Purpose	Analogy
Query (Q)	What I am looking for	A search query
Key (K)	What I represent	A document label
Value (V)	What information I carry	The document content

Attention Score Formula:

QKᵀ — dot product measures token relevance to each other
√dk — scaling prevents exploding gradients in large dimensions
softmax — normalizes scores to a probability distribution
Multiply by V — weighted sum of values based on relevance

Multi-Head Attention

Instead of one attention pass, the model uses H parallel heads, each learning different relationships (syntax, coreference, semantics). Outputs are concatenated and projected.

Feed-Forward Network (FFN)

After attention, each token passes through a two-layer MLP with a GELU/SiLU activation. This is where factual knowledge is primarily stored.

FFN(x) = GELU(xW₁ + b₁)W₂ + b₂

The FFN in modern LLMs is ~4x the model's hidden dimension — it's the majority of parameters.

LLM Inference Pipeline

Two Phases of Inference

1. Prefill Phase — Process the entire prompt in one forward pass (parallel). Expensive for long contexts.

2. Decode Phase — Generate one token at a time, autoregressively. Each step re-uses cached K/V tensors.

KV Cache

The Key-Value Cache stores intermediate attention computations so they don't have to be recomputed on every decoding step. It is the primary memory bottleneck in LLM serving.

Without KV Cache: O(n²) attention recomputation per token
With KV Cache:    O(n)  just compute new token's attention

System Design Note: KV Cache grows linearly with context length × batch size. At 128K context with large batches, it can consume 100s of GB of GPU VRAM.

Prompt Engineering Fundamentals

Prompt engineering is the practice of structuring inputs to get reliable, high-quality outputs from LLMs without changing model weights.

Core Techniques

Technique	Description	When to Use
Zero-shot	Task description only, no examples	Simple, well-defined tasks
Few-shot	2-5 examples in the prompt	Improve format / style consistency
Chain-of-Thought (CoT)	"Think step by step" reasoning	Math, logic, multi-step problems
Self-Consistency	Sample many CoT paths, majority vote	High-stakes reasoning
ReAct	Interleave Reasoning + Acting steps	Agent tool use
Tree-of-Thought (ToT)	Explore branching reasoning trees	Complex planning

Chain-of-Thought Example

Without CoT:
  Prompt:  "Roger has 5 balls. He buys 2 more cans of 3 balls. How many balls?"
  Output:  "11"  ← often wrong

With CoT:
  Prompt:  "Think step by step. Roger has 5 balls..."
  Output:  "Roger starts with 5 balls. He buys 2 cans × 3 balls = 6 balls.
            5 + 6 = 11 balls."  ← correct + auditable

What is an AI Agent?

An AI Agent is a system that uses an LLM as its reasoning core to perceive its environment, plan actions, execute tools, and iterate until a goal is achieved — without step-by-step human guidance.

The key difference from a plain LLM call:

Plain LLM	AI Agent
One-shot prompt → response	Multi-step reasoning loop
No external tools	Uses tools (APIs, databases, browsers)
No memory between calls	Maintains working memory
Stateless	Stateful across steps
Fixed output	Goal-directed behavior

The OODA Loop for AI Agents

Observe → Orient → Decide → Act → (loop back to Observe)

AI Agent Architecture

Core Components

Component	Role	Examples
LLM Core	Reasoning, planning, language generation	GPT-4, Claude, Gemini, Llama
Planner	Decompose goal into sub-tasks	ReAct, Plan-and-Solve, AutoGPT
Tool Registry	Catalogue of available tools + schemas	OpenAI Function Calling, LangChain Tools
Memory	Store intermediate results, past context	Vector DB, Redis, in-context
Executor	Call tools and return observations	Python sandbox, API client
Evaluator	Determine if goal is reached	LLM-as-judge, rule-based check

Agent Patterns & Topologies

1. ReAct Pattern (Reason + Act)

The most widely used single-agent pattern. The model interleaves thoughts with tool calls.

Thought: I need to find the capital of France.
Action: web_search("capital of France")
Observation: "Paris is the capital of France."
Thought: I now know the answer.
Final Answer: The capital of France is Paris.

2. Plan-and-Execute Pattern

Separate planning and execution phases. Better for long-horizon tasks.

3. Reflexion Pattern

Agent reflects on its failures and tries again with improved strategies.

Attempt 1 → Failure
Reflection: "I searched too broadly. I should narrow the query."
Attempt 2 → Partial Success
Reflection: "Missing the pricing data. I should query a different source."
Attempt 3 → Success

4. Multi-Agent Supervisor Pattern

A supervisor agent delegates sub-tasks to specialized worker agents.

Memory Systems in Agents

Agents need memory to maintain context, learn from past steps, and avoid repeating mistakes.

Four Types of Agent Memory

┌────────────────────────────────────────────────────────┐
│  In-Context (Working Memory)                           │
│  • Everything currently in the LLM's context window   │
│  • Fast, immediate, limited by token window            │
│  • Lost when session ends                             │
├────────────────────────────────────────────────────────┤
│  External Short-Term (Episodic)                        │
│  • Recent observations, tool outputs, chat history    │
│  • Stored in Redis or a DB, retrieved per session     │
│  • Survives across API calls                          │
├────────────────────────────────────────────────────────┤
│  Semantic Memory (Vectorized Knowledge)                │
│  • Encoded facts, documents, embeddings               │
│  • Stored in Vector DB (Pinecone, Weaviate, FAISS)    │
│  • Retrieved via similarity search (RAG)              │
├────────────────────────────────────────────────────────┤
│  Procedural Memory (Skills / Fine-tuned Weights)      │
│  • How to perform tasks — baked into model weights    │
│  • Updated via fine-tuning / LoRA                     │
│  • Permanent, expensive to update                     │
└────────────────────────────────────────────────────────┘

Memory Retrieval Flow (RAG-style)

Tool Use & Function Calling

Tools are the "hands" of an AI Agent — they let the LLM interact with the real world.

How Function Calling Works (OpenAI-style)

json

{
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
          "type": "object",
          "properties": {
            "city": { "type": "string" },
            "units": { "type": "string", "enum": ["celsius", "fahrenheit"] }
          },
          "required": ["city"]
        }
      }
    }
  ]
}

The LLM outputs a structured JSON tool call; the host application executes it and feeds the result back as an observation.

Common Tool Categories

Category	Examples	Use Case
Search	Tavily, Bing API, SerpAPI	Real-time web information
Code Execution	Python REPL, E2B sandbox	Computation, data analysis
Database	SQL query, vector search	Structured data retrieval
File I/O	Read/write files, parse PDFs	Document processing
Browser	Playwright, Puppeteer	Web scraping, UI automation
APIs	REST calls, GraphQL	Third-party services
Communication	Send email, Slack, SMS	Notifications, alerts

Multi-Agent Systems

Complex goals require multiple specialized agents collaborating.

Design Principles for Multi-Agent Systems

Single Responsibility — Each agent has one clear job
Shared Memory — Agents communicate via a shared state store, not direct calls
Idempotent Tools — Tools should be safe to retry on failure
Human-in-the-Loop — Critical decisions require human confirmation
Timeouts & Budgets — Set max steps, time limits, and token budgets per agent

RAG vs Fine-tuning vs Agents

Three ways to augment LLM capability — each with different trade-offs.

Approach	What It Is	Pros	Cons
RAG	Retrieve relevant docs at query time	Always fresh data, cheap, auditable	Retrieval quality is a bottleneck
Fine-tuning	Re-train model on domain data	Deep knowledge, fast inference	Expensive, data goes stale
Agents	LLM calls tools dynamically	Flexible, real-time, general	Latency, cost, complex to debug
RAG + Agents	Agents retrieve + reason	Best of both worlds	Highest complexity

Decision Matrix

Is the knowledge static and domain-specific?
  └─ YES → Fine-tuning OR RAG
       ├── Small dataset, need low latency → Fine-tuning
       └── Large/changing documents → RAG

Does the task require real-time data or external actions?
  └─ YES → Agents with tool use

Is the task multi-step with ambiguous sub-goals?
  └─ YES → Agents (ReAct or Plan-and-Execute)

Need low cost + simple Q&A over documents?
  └─ YES → RAG alone (no agents)

Production System Design for LLMs

High-Level Architecture

Key Infrastructure Components

Component	Purpose	Tools
LLM Proxy	Route to best model, handle fallback	LiteLLM, OpenRouter
Semantic Cache	Cache similar queries by embedding similarity	GPTCache, Redis
Rate Limiting	Per-user token/request limits	Redis token bucket
Guardrails	Detect harmful/off-topic outputs	Guardrails AI, Nemo
Observability	Track latency, cost, token usage	LangSmith, Langfuse
Vector DB	Store embeddings for RAG / agent memory	Pinecone, Weaviate

Latency Budget for an Agent Call

User request → API Gateway:        ~5ms
Gateway → LLM Proxy:               ~5ms
Prompt construction (RAG):         ~50-200ms  (embed + vector search)
LLM First Token (TTFT):            ~500ms-2s
LLM Generation (streaming):        ~2-10s total
Tool execution (if any):           ~100ms-3s per tool
Total per agent step:              ~3-15 seconds
Multi-step agent (5 steps):        ~15-60 seconds

Design Insight: Stream responses to the user immediately after first token. Never make users wait for the full generation.

Scaling Considerations

Token throughput bottleneck → GPU VRAM is the constraint, not CPU
KV Cache size              → Grows with batch size × context length
Cost                       → ~$15/1M output tokens (GPT-4o, 2025)
Latency vs throughput      → Continuous batching maximizes GPU utilization
Cold start                 → Large models take 10-30s to load; keep warm replicas

Key Trade-offs & Failure Modes

Trade-off Table

Decision	You Gain	You Lose
Larger model	Better reasoning	Higher cost, latency
Longer context	More information	Quadratic attention cost
More agent steps	Deeper reasoning	More latency, cost, chance of error
Fine-tuning	Domain expertise	Expensive, staleness risk
Streaming	Perceived speed	Complex client handling
Self-hosted model	Cost control, privacy	Infra overhead

Common Failure Modes

Failure	Cause	Mitigation
Hallucination	Model generates confident falsehoods	RAG grounding, citations, self-check
Prompt Injection	Malicious input hijacks agent behavior	Input sanitization, sandboxing
Infinite Loop	Agent can't reach goal, loops forever	Max steps limit, loop detection
Tool Misuse	Wrong arguments passed to tool	Strict JSON schema validation
Context Overflow	Too many tool results fill context window	Summarize + truncate observations
Cascading Cost	Runaway agent burns token budget	Hard cost caps per session
Stale Knowledge	Training cutoff misses recent events	RAG with live data sources

Glossary

Term	Definition
Token	Sub-word unit of text; ~0.75 words on average
Context Window	Max tokens the model can process at once
Temperature	Sampling randomness (0 = deterministic)
Top-P (nucleus)	Sample from top tokens summing to P probability
RLHF	Reinforcement Learning from Human Feedback
LoRA	Low-Rank Adaptation — efficient fine-tuning method
KV Cache	Cached Key-Value tensors to speed up decoding
Embedding	High-dimensional vector representation of text
RAG	Retrieval-Augmented Generation
Chain-of-Thought	Prompting model to reason step-by-step
Function Calling	Structured LLM output invoking a defined tool schema
Agent Loop	Observe → Plan → Act → Observe cycle
Guardrails	Safety checks on model inputs/outputs
TTFT	Time to First Token — key latency metric
Hallucination	Model generating confident but factually wrong content

Related Topics: Vector Databases · Vector Embeddings · RAG Pattern · ANN & HNSW Indexing

Made with ❤️ for System Design Mastery

🏠 Back to Home · 📖 Databases Overview

🤖 LLMs & AI Agents — Full Deep Dive ​

📑 Table of Contents ​

What is a Large Language Model (LLM)? ​

Key Properties ​

Training Phases ​

How LLMs Work Internally ​

Tokenization ​

Embeddings ​

Positional Encoding ​

Transformer Architecture ​

Self-Attention (The Core Mechanism) ​

Multi-Head Attention ​

Feed-Forward Network (FFN) ​

LLM Inference Pipeline ​

Two Phases of Inference ​

KV Cache ​

Prompt Engineering Fundamentals ​

Core Techniques ​

Chain-of-Thought Example ​

What is an AI Agent? ​

The OODA Loop for AI Agents ​

AI Agent Architecture ​

Core Components ​

Agent Patterns & Topologies ​

1. ReAct Pattern (Reason + Act) ​

2. Plan-and-Execute Pattern ​

3. Reflexion Pattern ​

4. Multi-Agent Supervisor Pattern ​

Memory Systems in Agents ​

Four Types of Agent Memory ​

Memory Retrieval Flow (RAG-style) ​

Tool Use & Function Calling ​

How Function Calling Works (OpenAI-style) ​

Common Tool Categories ​

Multi-Agent Systems ​

Design Principles for Multi-Agent Systems ​

RAG vs Fine-tuning vs Agents ​

Decision Matrix ​

Production System Design for LLMs ​

High-Level Architecture ​

Key Infrastructure Components ​

Latency Budget for an Agent Call ​

Scaling Considerations ​

Key Trade-offs & Failure Modes ​

Trade-off Table ​

Common Failure Modes ​

Glossary ​