Skip to content

๐Ÿค– LLMs & AI Agents โ€” Full Deep Dive โ€‹

Large Language Models and AI Agents are reshaping the way we build intelligent, autonomous systems. This guide explains what they are under the hood, how they work, and how to architect production systems around them โ€” from a system design perspective.


๐Ÿ“‘ Table of Contents โ€‹


What is a Large Language Model (LLM)? โ€‹

A Large Language Model is a deep neural network trained on massive text corpora to predict the next token (word-piece) in a sequence. Through this deceptively simple training objective, the model learns grammar, facts, reasoning, code, and even emergent capabilities like analogical reasoning.

Key Properties โ€‹

PropertyDescription
ParametersBillions of learned weights (GPT-4 ~1.8T, Llama 3 70B = 70 billion)
Context WindowMaximum tokens the model can "see" at once (e.g. 128K for GPT-4o)
TokenizationText split into sub-word tokens via BPE / SentencePiece
TemperatureControls randomness in sampling (0 = deterministic, 1+ = creative)
Emergent BehaviorCapabilities not explicitly trained โ€” appear at scale thresholds

Training Phases โ€‹

Phase 1 โ€” Pre-training
  โ”œโ”€โ”€ Dataset: Trillions of tokens (web, books, code, papers)
  โ”œโ”€โ”€ Objective: Next-token prediction (Causal LM)
  โ””โ”€โ”€ Cost: Millions of $ in GPU compute

Phase 2 โ€” Supervised Fine-tuning (SFT)
  โ”œโ”€โ”€ Dataset: High-quality instruction-response pairs
  โ”œโ”€โ”€ Objective: Follow human instructions accurately
  โ””โ”€โ”€ Teaches: Format, tone, task following

Phase 3 โ€” RLHF (Reinforcement Learning from Human Feedback)
  โ”œโ”€โ”€ Reward Model: Humans rank model responses
  โ”œโ”€โ”€ RL Loop: PPO algorithm maximizes reward
  โ””โ”€โ”€ Result: Aligned, helpful, safe outputs

How LLMs Work Internally โ€‹

Tokenization โ€‹

Before any computation, text is split into tokens โ€” sub-word units that balance vocabulary size with coverage.

Input:  "System Design"
Tokens: ["System", " Design"]   โ† whitespace is part of the token
IDs:    [6rdm, 7281]             โ† integer IDs looked up in an embedding table

Embeddings โ€‹

Each token ID is converted to a high-dimensional vector (e.g. 4096 dimensions for a 7B model). The model learns that tokens with similar meanings have similar vector directions.

"king" - "man" + "woman" โ‰ˆ "queen"   โ† classic word2vec result

Positional Encoding โ€‹

Transformers have no inherent sense of order. Position is injected by adding positional encodings to embeddings โ€” either fixed sinusoids (original Transformer) or learned Rotary Position Embeddings (RoPE) used by modern LLMs.


Transformer Architecture โ€‹

The Transformer is the foundation of every modern LLM.

Self-Attention (The Core Mechanism) โ€‹

Self-attention lets every token attend to every other token in the context. For each token, three vectors are computed:

VectorPurposeAnalogy
Query (Q)What I am looking forA search query
Key (K)What I representA document label
Value (V)What information I carryThe document content

Attention Score Formula:

  • QKแต€ โ€” dot product measures token relevance to each other
  • โˆšdk โ€” scaling prevents exploding gradients in large dimensions
  • softmax โ€” normalizes scores to a probability distribution
  • Multiply by V โ€” weighted sum of values based on relevance

Multi-Head Attention โ€‹

Instead of one attention pass, the model uses H parallel heads, each learning different relationships (syntax, coreference, semantics). Outputs are concatenated and projected.

Feed-Forward Network (FFN) โ€‹

After attention, each token passes through a two-layer MLP with a GELU/SiLU activation. This is where factual knowledge is primarily stored.

FFN(x) = GELU(xWโ‚ + bโ‚)Wโ‚‚ + bโ‚‚

The FFN in modern LLMs is ~4x the model's hidden dimension โ€” it's the majority of parameters.


LLM Inference Pipeline โ€‹

Two Phases of Inference โ€‹

1. Prefill Phase โ€” Process the entire prompt in one forward pass (parallel). Expensive for long contexts.

2. Decode Phase โ€” Generate one token at a time, autoregressively. Each step re-uses cached K/V tensors.

KV Cache โ€‹

The Key-Value Cache stores intermediate attention computations so they don't have to be recomputed on every decoding step. It is the primary memory bottleneck in LLM serving.

Without KV Cache: O(nยฒ) attention recomputation per token
With KV Cache:    O(n)  just compute new token's attention

System Design Note: KV Cache grows linearly with context length ร— batch size. At 128K context with large batches, it can consume 100s of GB of GPU VRAM.


Prompt Engineering Fundamentals โ€‹

Prompt engineering is the practice of structuring inputs to get reliable, high-quality outputs from LLMs without changing model weights.

Core Techniques โ€‹

TechniqueDescriptionWhen to Use
Zero-shotTask description only, no examplesSimple, well-defined tasks
Few-shot2-5 examples in the promptImprove format / style consistency
Chain-of-Thought (CoT)"Think step by step" reasoningMath, logic, multi-step problems
Self-ConsistencySample many CoT paths, majority voteHigh-stakes reasoning
ReActInterleave Reasoning + Acting stepsAgent tool use
Tree-of-Thought (ToT)Explore branching reasoning treesComplex planning

Chain-of-Thought Example โ€‹

Without CoT:
  Prompt:  "Roger has 5 balls. He buys 2 more cans of 3 balls. How many balls?"
  Output:  "11"  โ† often wrong

With CoT:
  Prompt:  "Think step by step. Roger has 5 balls..."
  Output:  "Roger starts with 5 balls. He buys 2 cans ร— 3 balls = 6 balls.
            5 + 6 = 11 balls."  โ† correct + auditable

What is an AI Agent? โ€‹

An AI Agent is a system that uses an LLM as its reasoning core to perceive its environment, plan actions, execute tools, and iterate until a goal is achieved โ€” without step-by-step human guidance.

The key difference from a plain LLM call:

Plain LLMAI Agent
One-shot prompt โ†’ responseMulti-step reasoning loop
No external toolsUses tools (APIs, databases, browsers)
No memory between callsMaintains working memory
StatelessStateful across steps
Fixed outputGoal-directed behavior

The OODA Loop for AI Agents โ€‹

Observe โ†’ Orient โ†’ Decide โ†’ Act โ†’ (loop back to Observe)

AI Agent Architecture โ€‹

Core Components โ€‹

ComponentRoleExamples
LLM CoreReasoning, planning, language generationGPT-4, Claude, Gemini, Llama
PlannerDecompose goal into sub-tasksReAct, Plan-and-Solve, AutoGPT
Tool RegistryCatalogue of available tools + schemasOpenAI Function Calling, LangChain Tools
MemoryStore intermediate results, past contextVector DB, Redis, in-context
ExecutorCall tools and return observationsPython sandbox, API client
EvaluatorDetermine if goal is reachedLLM-as-judge, rule-based check

Agent Patterns & Topologies โ€‹

1. ReAct Pattern (Reason + Act) โ€‹

The most widely used single-agent pattern. The model interleaves thoughts with tool calls.

Thought: I need to find the capital of France.
Action: web_search("capital of France")
Observation: "Paris is the capital of France."
Thought: I now know the answer.
Final Answer: The capital of France is Paris.

2. Plan-and-Execute Pattern โ€‹

Separate planning and execution phases. Better for long-horizon tasks.

3. Reflexion Pattern โ€‹

Agent reflects on its failures and tries again with improved strategies.

Attempt 1 โ†’ Failure
Reflection: "I searched too broadly. I should narrow the query."
Attempt 2 โ†’ Partial Success
Reflection: "Missing the pricing data. I should query a different source."
Attempt 3 โ†’ Success

4. Multi-Agent Supervisor Pattern โ€‹

A supervisor agent delegates sub-tasks to specialized worker agents.


Memory Systems in Agents โ€‹

Agents need memory to maintain context, learn from past steps, and avoid repeating mistakes.

Four Types of Agent Memory โ€‹

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  In-Context (Working Memory)                           โ”‚
โ”‚  โ€ข Everything currently in the LLM's context window   โ”‚
โ”‚  โ€ข Fast, immediate, limited by token window            โ”‚
โ”‚  โ€ข Lost when session ends                             โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  External Short-Term (Episodic)                        โ”‚
โ”‚  โ€ข Recent observations, tool outputs, chat history    โ”‚
โ”‚  โ€ข Stored in Redis or a DB, retrieved per session     โ”‚
โ”‚  โ€ข Survives across API calls                          โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  Semantic Memory (Vectorized Knowledge)                โ”‚
โ”‚  โ€ข Encoded facts, documents, embeddings               โ”‚
โ”‚  โ€ข Stored in Vector DB (Pinecone, Weaviate, FAISS)    โ”‚
โ”‚  โ€ข Retrieved via similarity search (RAG)              โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  Procedural Memory (Skills / Fine-tuned Weights)      โ”‚
โ”‚  โ€ข How to perform tasks โ€” baked into model weights    โ”‚
โ”‚  โ€ข Updated via fine-tuning / LoRA                     โ”‚
โ”‚  โ€ข Permanent, expensive to update                     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Memory Retrieval Flow (RAG-style) โ€‹


Tool Use & Function Calling โ€‹

Tools are the "hands" of an AI Agent โ€” they let the LLM interact with the real world.

How Function Calling Works (OpenAI-style) โ€‹

json
{
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
          "type": "object",
          "properties": {
            "city": { "type": "string" },
            "units": { "type": "string", "enum": ["celsius", "fahrenheit"] }
          },
          "required": ["city"]
        }
      }
    }
  ]
}

The LLM outputs a structured JSON tool call; the host application executes it and feeds the result back as an observation.

Common Tool Categories โ€‹

CategoryExamplesUse Case
SearchTavily, Bing API, SerpAPIReal-time web information
Code ExecutionPython REPL, E2B sandboxComputation, data analysis
DatabaseSQL query, vector searchStructured data retrieval
File I/ORead/write files, parse PDFsDocument processing
BrowserPlaywright, PuppeteerWeb scraping, UI automation
APIsREST calls, GraphQLThird-party services
CommunicationSend email, Slack, SMSNotifications, alerts

Multi-Agent Systems โ€‹

Complex goals require multiple specialized agents collaborating.

Design Principles for Multi-Agent Systems โ€‹

  1. Single Responsibility โ€” Each agent has one clear job
  2. Shared Memory โ€” Agents communicate via a shared state store, not direct calls
  3. Idempotent Tools โ€” Tools should be safe to retry on failure
  4. Human-in-the-Loop โ€” Critical decisions require human confirmation
  5. Timeouts & Budgets โ€” Set max steps, time limits, and token budgets per agent

RAG vs Fine-tuning vs Agents โ€‹

Three ways to augment LLM capability โ€” each with different trade-offs.

ApproachWhat It IsProsCons
RAGRetrieve relevant docs at query timeAlways fresh data, cheap, auditableRetrieval quality is a bottleneck
Fine-tuningRe-train model on domain dataDeep knowledge, fast inferenceExpensive, data goes stale
AgentsLLM calls tools dynamicallyFlexible, real-time, generalLatency, cost, complex to debug
RAG + AgentsAgents retrieve + reasonBest of both worldsHighest complexity

Decision Matrix โ€‹

Is the knowledge static and domain-specific?
  โ””โ”€ YES โ†’ Fine-tuning OR RAG
       โ”œโ”€โ”€ Small dataset, need low latency โ†’ Fine-tuning
       โ””โ”€โ”€ Large/changing documents โ†’ RAG

Does the task require real-time data or external actions?
  โ””โ”€ YES โ†’ Agents with tool use

Is the task multi-step with ambiguous sub-goals?
  โ””โ”€ YES โ†’ Agents (ReAct or Plan-and-Execute)

Need low cost + simple Q&A over documents?
  โ””โ”€ YES โ†’ RAG alone (no agents)

Production System Design for LLMs โ€‹

High-Level Architecture โ€‹

Key Infrastructure Components โ€‹

ComponentPurposeTools
LLM ProxyRoute to best model, handle fallbackLiteLLM, OpenRouter
Semantic CacheCache similar queries by embedding similarityGPTCache, Redis
Rate LimitingPer-user token/request limitsRedis token bucket
GuardrailsDetect harmful/off-topic outputsGuardrails AI, Nemo
ObservabilityTrack latency, cost, token usageLangSmith, Langfuse
Vector DBStore embeddings for RAG / agent memoryPinecone, Weaviate

Latency Budget for an Agent Call โ€‹

User request โ†’ API Gateway:        ~5ms
Gateway โ†’ LLM Proxy:               ~5ms
Prompt construction (RAG):         ~50-200ms  (embed + vector search)
LLM First Token (TTFT):            ~500ms-2s
LLM Generation (streaming):        ~2-10s total
Tool execution (if any):           ~100ms-3s per tool
Total per agent step:              ~3-15 seconds
Multi-step agent (5 steps):        ~15-60 seconds

Design Insight: Stream responses to the user immediately after first token. Never make users wait for the full generation.

Scaling Considerations โ€‹

Token throughput bottleneck โ†’ GPU VRAM is the constraint, not CPU
KV Cache size              โ†’ Grows with batch size ร— context length
Cost                       โ†’ ~$15/1M output tokens (GPT-4o, 2025)
Latency vs throughput      โ†’ Continuous batching maximizes GPU utilization
Cold start                 โ†’ Large models take 10-30s to load; keep warm replicas

Key Trade-offs & Failure Modes โ€‹

Trade-off Table โ€‹

DecisionYou GainYou Lose
Larger modelBetter reasoningHigher cost, latency
Longer contextMore informationQuadratic attention cost
More agent stepsDeeper reasoningMore latency, cost, chance of error
Fine-tuningDomain expertiseExpensive, staleness risk
StreamingPerceived speedComplex client handling
Self-hosted modelCost control, privacyInfra overhead

Common Failure Modes โ€‹

FailureCauseMitigation
HallucinationModel generates confident falsehoodsRAG grounding, citations, self-check
Prompt InjectionMalicious input hijacks agent behaviorInput sanitization, sandboxing
Infinite LoopAgent can't reach goal, loops foreverMax steps limit, loop detection
Tool MisuseWrong arguments passed to toolStrict JSON schema validation
Context OverflowToo many tool results fill context windowSummarize + truncate observations
Cascading CostRunaway agent burns token budgetHard cost caps per session
Stale KnowledgeTraining cutoff misses recent eventsRAG with live data sources

Glossary โ€‹

TermDefinition
TokenSub-word unit of text; ~0.75 words on average
Context WindowMax tokens the model can process at once
TemperatureSampling randomness (0 = deterministic)
Top-P (nucleus)Sample from top tokens summing to P probability
RLHFReinforcement Learning from Human Feedback
LoRALow-Rank Adaptation โ€” efficient fine-tuning method
KV CacheCached Key-Value tensors to speed up decoding
EmbeddingHigh-dimensional vector representation of text
RAGRetrieval-Augmented Generation
Chain-of-ThoughtPrompting model to reason step-by-step
Function CallingStructured LLM output invoking a defined tool schema
Agent LoopObserve โ†’ Plan โ†’ Act โ†’ Observe cycle
GuardrailsSafety checks on model inputs/outputs
TTFTTime to First Token โ€” key latency metric
HallucinationModel generating confident but factually wrong content

Related Topics: Vector Databases ยท Vector Embeddings ยท RAG Pattern ยท ANN & HNSW Indexing


Made with โค๏ธ for System Design Mastery

๐Ÿ  Back to Home ยท ๐Ÿ“– Databases Overview

Released under the ISC License.