๐ค LLMs & AI Agents โ Full Deep Dive โ
Large Language Models and AI Agents are reshaping the way we build intelligent, autonomous systems. This guide explains what they are under the hood, how they work, and how to architect production systems around them โ from a system design perspective.
๐ Table of Contents โ
- What is a Large Language Model (LLM)?
- How LLMs Work Internally
- Transformer Architecture
- LLM Inference Pipeline
- Prompt Engineering Fundamentals
- What is an AI Agent?
- AI Agent Architecture
- Agent Patterns & Topologies
- Memory Systems in Agents
- Tool Use & Function Calling
- Multi-Agent Systems
- RAG vs Fine-tuning vs Agents
- Production System Design for LLMs
- Key Trade-offs & Failure Modes
- Glossary
What is a Large Language Model (LLM)? โ
A Large Language Model is a deep neural network trained on massive text corpora to predict the next token (word-piece) in a sequence. Through this deceptively simple training objective, the model learns grammar, facts, reasoning, code, and even emergent capabilities like analogical reasoning.
Key Properties โ
| Property | Description |
|---|---|
| Parameters | Billions of learned weights (GPT-4 ~1.8T, Llama 3 70B = 70 billion) |
| Context Window | Maximum tokens the model can "see" at once (e.g. 128K for GPT-4o) |
| Tokenization | Text split into sub-word tokens via BPE / SentencePiece |
| Temperature | Controls randomness in sampling (0 = deterministic, 1+ = creative) |
| Emergent Behavior | Capabilities not explicitly trained โ appear at scale thresholds |
Training Phases โ
Phase 1 โ Pre-training
โโโ Dataset: Trillions of tokens (web, books, code, papers)
โโโ Objective: Next-token prediction (Causal LM)
โโโ Cost: Millions of $ in GPU compute
Phase 2 โ Supervised Fine-tuning (SFT)
โโโ Dataset: High-quality instruction-response pairs
โโโ Objective: Follow human instructions accurately
โโโ Teaches: Format, tone, task following
Phase 3 โ RLHF (Reinforcement Learning from Human Feedback)
โโโ Reward Model: Humans rank model responses
โโโ RL Loop: PPO algorithm maximizes reward
โโโ Result: Aligned, helpful, safe outputsHow LLMs Work Internally โ
Tokenization โ
Before any computation, text is split into tokens โ sub-word units that balance vocabulary size with coverage.
Input: "System Design"
Tokens: ["System", " Design"] โ whitespace is part of the token
IDs: [6rdm, 7281] โ integer IDs looked up in an embedding tableEmbeddings โ
Each token ID is converted to a high-dimensional vector (e.g. 4096 dimensions for a 7B model). The model learns that tokens with similar meanings have similar vector directions.
"king" - "man" + "woman" โ "queen" โ classic word2vec resultPositional Encoding โ
Transformers have no inherent sense of order. Position is injected by adding positional encodings to embeddings โ either fixed sinusoids (original Transformer) or learned Rotary Position Embeddings (RoPE) used by modern LLMs.
Transformer Architecture โ
The Transformer is the foundation of every modern LLM.
Self-Attention (The Core Mechanism) โ
Self-attention lets every token attend to every other token in the context. For each token, three vectors are computed:
| Vector | Purpose | Analogy |
|---|---|---|
| Query (Q) | What I am looking for | A search query |
| Key (K) | What I represent | A document label |
| Value (V) | What information I carry | The document content |
Attention Score Formula:
QKแตโ dot product measures token relevance to each otherโdkโ scaling prevents exploding gradients in large dimensionssoftmaxโ normalizes scores to a probability distribution- Multiply by
Vโ weighted sum of values based on relevance
Multi-Head Attention โ
Instead of one attention pass, the model uses H parallel heads, each learning different relationships (syntax, coreference, semantics). Outputs are concatenated and projected.
Feed-Forward Network (FFN) โ
After attention, each token passes through a two-layer MLP with a GELU/SiLU activation. This is where factual knowledge is primarily stored.
FFN(x) = GELU(xWโ + bโ)Wโ + bโThe FFN in modern LLMs is ~4x the model's hidden dimension โ it's the majority of parameters.
LLM Inference Pipeline โ
Two Phases of Inference โ
1. Prefill Phase โ Process the entire prompt in one forward pass (parallel). Expensive for long contexts.
2. Decode Phase โ Generate one token at a time, autoregressively. Each step re-uses cached K/V tensors.
KV Cache โ
The Key-Value Cache stores intermediate attention computations so they don't have to be recomputed on every decoding step. It is the primary memory bottleneck in LLM serving.
Without KV Cache: O(nยฒ) attention recomputation per token
With KV Cache: O(n) just compute new token's attentionSystem Design Note: KV Cache grows linearly with context length ร batch size. At 128K context with large batches, it can consume 100s of GB of GPU VRAM.
Prompt Engineering Fundamentals โ
Prompt engineering is the practice of structuring inputs to get reliable, high-quality outputs from LLMs without changing model weights.
Core Techniques โ
| Technique | Description | When to Use |
|---|---|---|
| Zero-shot | Task description only, no examples | Simple, well-defined tasks |
| Few-shot | 2-5 examples in the prompt | Improve format / style consistency |
| Chain-of-Thought (CoT) | "Think step by step" reasoning | Math, logic, multi-step problems |
| Self-Consistency | Sample many CoT paths, majority vote | High-stakes reasoning |
| ReAct | Interleave Reasoning + Acting steps | Agent tool use |
| Tree-of-Thought (ToT) | Explore branching reasoning trees | Complex planning |
Chain-of-Thought Example โ
Without CoT:
Prompt: "Roger has 5 balls. He buys 2 more cans of 3 balls. How many balls?"
Output: "11" โ often wrong
With CoT:
Prompt: "Think step by step. Roger has 5 balls..."
Output: "Roger starts with 5 balls. He buys 2 cans ร 3 balls = 6 balls.
5 + 6 = 11 balls." โ correct + auditableWhat is an AI Agent? โ
An AI Agent is a system that uses an LLM as its reasoning core to perceive its environment, plan actions, execute tools, and iterate until a goal is achieved โ without step-by-step human guidance.
The key difference from a plain LLM call:
| Plain LLM | AI Agent |
|---|---|
| One-shot prompt โ response | Multi-step reasoning loop |
| No external tools | Uses tools (APIs, databases, browsers) |
| No memory between calls | Maintains working memory |
| Stateless | Stateful across steps |
| Fixed output | Goal-directed behavior |
The OODA Loop for AI Agents โ
Observe โ Orient โ Decide โ Act โ (loop back to Observe)AI Agent Architecture โ
Core Components โ
| Component | Role | Examples |
|---|---|---|
| LLM Core | Reasoning, planning, language generation | GPT-4, Claude, Gemini, Llama |
| Planner | Decompose goal into sub-tasks | ReAct, Plan-and-Solve, AutoGPT |
| Tool Registry | Catalogue of available tools + schemas | OpenAI Function Calling, LangChain Tools |
| Memory | Store intermediate results, past context | Vector DB, Redis, in-context |
| Executor | Call tools and return observations | Python sandbox, API client |
| Evaluator | Determine if goal is reached | LLM-as-judge, rule-based check |
Agent Patterns & Topologies โ
1. ReAct Pattern (Reason + Act) โ
The most widely used single-agent pattern. The model interleaves thoughts with tool calls.
Thought: I need to find the capital of France.
Action: web_search("capital of France")
Observation: "Paris is the capital of France."
Thought: I now know the answer.
Final Answer: The capital of France is Paris.2. Plan-and-Execute Pattern โ
Separate planning and execution phases. Better for long-horizon tasks.
3. Reflexion Pattern โ
Agent reflects on its failures and tries again with improved strategies.
Attempt 1 โ Failure
Reflection: "I searched too broadly. I should narrow the query."
Attempt 2 โ Partial Success
Reflection: "Missing the pricing data. I should query a different source."
Attempt 3 โ Success4. Multi-Agent Supervisor Pattern โ
A supervisor agent delegates sub-tasks to specialized worker agents.
Memory Systems in Agents โ
Agents need memory to maintain context, learn from past steps, and avoid repeating mistakes.
Four Types of Agent Memory โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ In-Context (Working Memory) โ
โ โข Everything currently in the LLM's context window โ
โ โข Fast, immediate, limited by token window โ
โ โข Lost when session ends โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ External Short-Term (Episodic) โ
โ โข Recent observations, tool outputs, chat history โ
โ โข Stored in Redis or a DB, retrieved per session โ
โ โข Survives across API calls โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Semantic Memory (Vectorized Knowledge) โ
โ โข Encoded facts, documents, embeddings โ
โ โข Stored in Vector DB (Pinecone, Weaviate, FAISS) โ
โ โข Retrieved via similarity search (RAG) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Procedural Memory (Skills / Fine-tuned Weights) โ
โ โข How to perform tasks โ baked into model weights โ
โ โข Updated via fine-tuning / LoRA โ
โ โข Permanent, expensive to update โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโMemory Retrieval Flow (RAG-style) โ
Tool Use & Function Calling โ
Tools are the "hands" of an AI Agent โ they let the LLM interact with the real world.
How Function Calling Works (OpenAI-style) โ
{
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": { "type": "string" },
"units": { "type": "string", "enum": ["celsius", "fahrenheit"] }
},
"required": ["city"]
}
}
}
]
}The LLM outputs a structured JSON tool call; the host application executes it and feeds the result back as an observation.
Common Tool Categories โ
| Category | Examples | Use Case |
|---|---|---|
| Search | Tavily, Bing API, SerpAPI | Real-time web information |
| Code Execution | Python REPL, E2B sandbox | Computation, data analysis |
| Database | SQL query, vector search | Structured data retrieval |
| File I/O | Read/write files, parse PDFs | Document processing |
| Browser | Playwright, Puppeteer | Web scraping, UI automation |
| APIs | REST calls, GraphQL | Third-party services |
| Communication | Send email, Slack, SMS | Notifications, alerts |
Multi-Agent Systems โ
Complex goals require multiple specialized agents collaborating.
Design Principles for Multi-Agent Systems โ
- Single Responsibility โ Each agent has one clear job
- Shared Memory โ Agents communicate via a shared state store, not direct calls
- Idempotent Tools โ Tools should be safe to retry on failure
- Human-in-the-Loop โ Critical decisions require human confirmation
- Timeouts & Budgets โ Set max steps, time limits, and token budgets per agent
RAG vs Fine-tuning vs Agents โ
Three ways to augment LLM capability โ each with different trade-offs.
| Approach | What It Is | Pros | Cons |
|---|---|---|---|
| RAG | Retrieve relevant docs at query time | Always fresh data, cheap, auditable | Retrieval quality is a bottleneck |
| Fine-tuning | Re-train model on domain data | Deep knowledge, fast inference | Expensive, data goes stale |
| Agents | LLM calls tools dynamically | Flexible, real-time, general | Latency, cost, complex to debug |
| RAG + Agents | Agents retrieve + reason | Best of both worlds | Highest complexity |
Decision Matrix โ
Is the knowledge static and domain-specific?
โโ YES โ Fine-tuning OR RAG
โโโ Small dataset, need low latency โ Fine-tuning
โโโ Large/changing documents โ RAG
Does the task require real-time data or external actions?
โโ YES โ Agents with tool use
Is the task multi-step with ambiguous sub-goals?
โโ YES โ Agents (ReAct or Plan-and-Execute)
Need low cost + simple Q&A over documents?
โโ YES โ RAG alone (no agents)Production System Design for LLMs โ
High-Level Architecture โ
Key Infrastructure Components โ
| Component | Purpose | Tools |
|---|---|---|
| LLM Proxy | Route to best model, handle fallback | LiteLLM, OpenRouter |
| Semantic Cache | Cache similar queries by embedding similarity | GPTCache, Redis |
| Rate Limiting | Per-user token/request limits | Redis token bucket |
| Guardrails | Detect harmful/off-topic outputs | Guardrails AI, Nemo |
| Observability | Track latency, cost, token usage | LangSmith, Langfuse |
| Vector DB | Store embeddings for RAG / agent memory | Pinecone, Weaviate |
Latency Budget for an Agent Call โ
User request โ API Gateway: ~5ms
Gateway โ LLM Proxy: ~5ms
Prompt construction (RAG): ~50-200ms (embed + vector search)
LLM First Token (TTFT): ~500ms-2s
LLM Generation (streaming): ~2-10s total
Tool execution (if any): ~100ms-3s per tool
Total per agent step: ~3-15 seconds
Multi-step agent (5 steps): ~15-60 secondsDesign Insight: Stream responses to the user immediately after first token. Never make users wait for the full generation.
Scaling Considerations โ
Token throughput bottleneck โ GPU VRAM is the constraint, not CPU
KV Cache size โ Grows with batch size ร context length
Cost โ ~$15/1M output tokens (GPT-4o, 2025)
Latency vs throughput โ Continuous batching maximizes GPU utilization
Cold start โ Large models take 10-30s to load; keep warm replicasKey Trade-offs & Failure Modes โ
Trade-off Table โ
| Decision | You Gain | You Lose |
|---|---|---|
| Larger model | Better reasoning | Higher cost, latency |
| Longer context | More information | Quadratic attention cost |
| More agent steps | Deeper reasoning | More latency, cost, chance of error |
| Fine-tuning | Domain expertise | Expensive, staleness risk |
| Streaming | Perceived speed | Complex client handling |
| Self-hosted model | Cost control, privacy | Infra overhead |
Common Failure Modes โ
| Failure | Cause | Mitigation |
|---|---|---|
| Hallucination | Model generates confident falsehoods | RAG grounding, citations, self-check |
| Prompt Injection | Malicious input hijacks agent behavior | Input sanitization, sandboxing |
| Infinite Loop | Agent can't reach goal, loops forever | Max steps limit, loop detection |
| Tool Misuse | Wrong arguments passed to tool | Strict JSON schema validation |
| Context Overflow | Too many tool results fill context window | Summarize + truncate observations |
| Cascading Cost | Runaway agent burns token budget | Hard cost caps per session |
| Stale Knowledge | Training cutoff misses recent events | RAG with live data sources |
Glossary โ
| Term | Definition |
|---|---|
| Token | Sub-word unit of text; ~0.75 words on average |
| Context Window | Max tokens the model can process at once |
| Temperature | Sampling randomness (0 = deterministic) |
| Top-P (nucleus) | Sample from top tokens summing to P probability |
| RLHF | Reinforcement Learning from Human Feedback |
| LoRA | Low-Rank Adaptation โ efficient fine-tuning method |
| KV Cache | Cached Key-Value tensors to speed up decoding |
| Embedding | High-dimensional vector representation of text |
| RAG | Retrieval-Augmented Generation |
| Chain-of-Thought | Prompting model to reason step-by-step |
| Function Calling | Structured LLM output invoking a defined tool schema |
| Agent Loop | Observe โ Plan โ Act โ Observe cycle |
| Guardrails | Safety checks on model inputs/outputs |
| TTFT | Time to First Token โ key latency metric |
| Hallucination | Model generating confident but factually wrong content |
Related Topics: Vector Databases ยท Vector Embeddings ยท RAG Pattern ยท ANN & HNSW Indexing
Made with โค๏ธ for System Design Mastery
