๐ข Vector Embeddings โ Complete Guide โ
Turning meaning into math โ the foundation of modern AI systems.
What is a Vector Embedding? โ
A vector embedding is a numerical representation of real-world data (text, image, audio, video, code) as a fixed-size array of floating-point numbers.
The key insight: similar things produce similar numbers.
Word: "King" โ [0.81, 0.22, 0.67, 0.14, ...] โโ
Word: "Queen" โ [0.79, 0.25, 0.65, 0.19, ...] โโ nearby in space
Word: "Apple" โ [0.12, 0.91, 0.03, 0.88, ...] far awayThe model doesn't just store the word โ it encodes its meaning, relationships, and context into a point in high-dimensional space.
๐ง The Intuition โ From Words to Numbers โ
Step 1 โ One-Hot Encoding (naive, broken) โ
The old way: represent each word as a sparse binary vector.
Vocabulary: [cat, dog, king, queen, apple]
"cat" = [1, 0, 0, 0, 0]
"dog" = [0, 1, 0, 0, 0]
"king" = [0, 0, 1, 0, 0]
"queen" = [0, 0, 0, 1, 0]Problems:
- No relationship encoded โ "king" and "queen" look as different as "cat" and "apple"
- Scales to millions of dimensions (one per word) โ unusable
- No semantic similarity captured
Step 2 โ Dense Embeddings (modern, powerful) โ
A neural network learns to compress meaning into a small dense vector:
"king" โ [0.81, 0.22, 0.67, 0.14, 0.55, ...] (300-1536 numbers)
"queen" โ [0.79, 0.25, 0.65, 0.19, 0.53, ...] (very close!)
"apple" โ [0.12, 0.91, 0.03, 0.88, 0.11, ...] (far away)The famous analogy holds: King โ Man + Woman โ Queen
๐ Geometry of Embeddings โ
In reality, embeddings live in hundreds to thousands of dimensions โ the geometry is the same, just much richer.
โ๏ธ How Embeddings Are Created โ
Popular Embedding Models โ
| Model | Provider | Dimensions | Best For |
|---|---|---|---|
text-embedding-ada-002 | OpenAI | 1536 | General text, RAG |
text-embedding-3-small | OpenAI | 1536 | Cost-efficient text |
text-embedding-3-large | OpenAI | 3072 | High-accuracy text |
text-embedding-gecko | 768 | Multilingual text | |
BERT / RoBERTa | HuggingFace | 768 | Open-source NLP |
CLIP | OpenAI | 512 | Text + Image (multimodal) |
Whisper embeddings | OpenAI | 1280 | Audio โ vector |
all-MiniLM-L6-v2 | SBERT | 384 | Fast, local, sentence similarity |
๐ Similarity Metrics โ How to Measure "Closeness" โ
Two vectors can be compared using different distance formulas depending on the use case.
1. Cosine Similarity โ
Measures the angle between two vectors. Ignores magnitude, cares about direction.
Range: -1 (opposite) โ 0 (unrelated) โ 1 (identical)
function cosineSimilarity(a, b) {
const dot = a.reduce((sum, ai, i) => sum + ai * b[i], 0);
const magA = Math.sqrt(a.reduce((sum, ai) => sum + ai * ai, 0));
const magB = Math.sqrt(b.reduce((sum, bi) => sum + bi * bi, 0));
return dot / (magA * magB);
}
const v1 = [0.21, 0.87, 0.43];
const v2 = [0.22, 0.85, 0.44];
const v3 = [0.91, 0.03, 0.72];
console.log(cosineSimilarity(v1, v2).toFixed(4)); // 0.9998 (very similar!)
console.log(cosineSimilarity(v1, v3).toFixed(4)); // 0.6341 (different)Best for: Text similarity, semantic search, RAG
2. Euclidean Distance โ
Measures the straight-line distance between two points in space.
Lower = more similar
function euclideanDistance(a, b) {
return Math.sqrt(a.reduce((sum, ai, i) => sum + Math.pow(ai - b[i], 2), 0));
}
console.log(euclideanDistance(v1, v2).toFixed(4)); // 0.0224 (very close!)
console.log(euclideanDistance(v1, v3).toFixed(4)); // 0.9503 (far apart)Best for: Image embeddings, spatial data, clustering
3. Dot Product โ
The raw multiplication sum of two vectors.
function dotProduct(a, b) {
return a.reduce((sum, ai, i) => sum + ai * b[i], 0);
}Best for: Recommendation systems (when vectors are normalized)
Comparison โ
๐ฌ Anatomy of a Vector Embedding โ
Each number in the vector represents an abstract learned feature. Dimensions don't have human-readable labels โ the neural network learns them during training.
Dimension 42 might encode "royalty", dimension 107 might encode "gender", but we can't know for sure โ the model decides.
๐ป Generating Embeddings โ Code Examples โ
Text Embedding (OpenAI) โ
// npm install openai
const { OpenAI } = require("openai");
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
async function embedText(text) {
const res = await openai.embeddings.create({
model: "text-embedding-3-small",
input: text,
});
return res.data[0].embedding; // float[] with 1536 values
}
const v1 = await embedText("I love machine learning");
const v2 = await embedText("I enjoy deep learning");
const v3 = await embedText("The stock market crashed today");
// v1 and v2 will be close, v3 will be far from bothBatch Embedding (Multiple Texts at Once) โ
async function embedBatch(texts) {
const res = await openai.embeddings.create({
model: "text-embedding-3-small",
input: texts, // Pass array directly โ more efficient!
});
return res.data.map((d) => d.embedding);
}
const sentences = [
"How do I reset my password?",
"I forgot my login credentials",
"What is the refund policy?",
"Can I get my money back?",
];
const vectors = await embedBatch(sentences);
// vectors[0] โ vectors[1] (both about password/login)
// vectors[2] โ vectors[3] (both about refunds)Local Embedding (No API โ Free) โ
// npm install @xenova/transformers
import { pipeline } from "@xenova/transformers";
const embedder = await pipeline(
"feature-extraction",
"Xenova/all-MiniLM-L6-v2"
);
async function embedLocal(text) {
const output = await embedder(text, { pooling: "mean", normalize: true });
return Array.from(output.data); // 384-dimensional vector
}
const vec = await embedLocal("Hello world");
console.log(vec.length); // 384๐บ๏ธ When to Use Embeddings โ
โ Use Embeddings When: โ
| Scenario | Why Embeddings Help |
|---|---|
| Semantic search | Find results by meaning, not just keywords |
| Chatbot / RAG | Retrieve relevant context for LLM answers |
| Recommendation | Suggest similar products / articles / songs |
| Clustering | Group documents by topic automatically |
| Zero-shot classification | Classify text without labeled training data |
| Duplicate detection | Find near-identical content across large corpus |
| Cross-language search | Match Spanish query with English docs |
| Code search | Find function by describing what it does |
โ Don't Use Embeddings When: โ
- You only need exact keyword matching (use full-text search)
- You need structured query filters (
price < 100,status = active) - You have very limited compute โ embeddings add latency
- Your dataset is tiny (< 1000 items) โ simpler methods work fine
๐ Full Pipeline โ From Raw Data to Search โ
๐ฆ Real-World Example: Duplicate Ticket Detector โ
Problem โ
A support team receives thousands of tickets daily. Many are duplicates. Manual review is impossible.
Solution โ Embedding-Based Deduplication โ
const { OpenAI } = require("openai");
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
// โโ Embed a support ticket โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
async function embed(text) {
const res = await openai.embeddings.create({
model: "text-embedding-3-small",
input: text,
});
return res.data[0].embedding;
}
// โโ Cosine similarity helper โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
function similarity(a, b) {
const dot = a.reduce((s, v, i) => s + v * b[i], 0);
const mag = (v) => Math.sqrt(v.reduce((s, x) => s + x * x, 0));
return dot / (mag(a) * mag(b));
}
// โโ Check if a new ticket is duplicate โโโโโโโโโโโโโโโโโ
async function isDuplicate(newTicket, existingTickets, threshold = 0.92) {
const newVec = await embed(newTicket.text);
for (const existing of existingTickets) {
const score = similarity(newVec, existing.vector);
if (score >= threshold) {
return {
isDuplicate: true,
matchId: existing.id,
score: score.toFixed(4),
};
}
}
return { isDuplicate: false };
}
// โโ Usage โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
const existingTickets = [
{
id: "T-001",
text: "I cannot login to my account",
vector: await embed("I cannot login to my account"),
},
];
const newTicket = { text: "Unable to sign in to my profile" };
const result = await isDuplicate(newTicket, existingTickets);
console.log(result);
// { isDuplicate: true, matchId: "T-001", score: "0.9541" }
// โ
Flagged as duplicate โ same intent, different words!๐งฎ Embedding Dimensions โ Trade-offs โ
| Dimensions | Model Example | Speed | Accuracy | Use Case |
|---|---|---|---|---|
| 384 | MiniLM-L6-v2 | โกโกโก | โญโญ | Real-time, edge, mobile |
| 768 | BERT, gecko | โกโก | โญโญโญ | General NLP |
| 1536 | ada-002, embed-3-sm | โก | โญโญโญโญ | Production RAG |
| 3072 | embed-3-large | ๐ข | โญโญโญโญโญ | High-stakes similarity |
๐ Multimodal Embeddings โ Same Space, Different Data โ
One of the most powerful ideas: embed text and images into the same vector space (e.g., OpenAI CLIP).
Use cases:
- Search images by typing a description
- Find images most similar to another image
- Auto-tagging / captioning images
๐ Embedding Quality Checklist โ
๐ Chunking Strategy for Long Documents โ
// Long documents must be split before embedding
function chunkText(text, chunkSize = 500, overlap = 50) {
const words = text.split(" ");
const chunks = [];
for (let i = 0; i < words.length; i += chunkSize - overlap) {
const chunk = words.slice(i, i + chunkSize).join(" ");
chunks.push({ text: chunk, start: i, end: i + chunkSize });
if (i + chunkSize >= words.length) break;
}
return chunks;
}
// Embed each chunk separately
async function embedDocument(fullText) {
const chunks = chunkText(fullText, 500, 50);
const results = [];
for (const chunk of chunks) {
const vector = await embed(chunk.text);
results.push({ ...chunk, vector });
}
return results; // array of { text, vector, start, end }
}๐ Embedding Models Compared โ
| Criterion | text-embedding-3-small | all-MiniLM-L6-v2 | text-embedding-gecko |
|---|---|---|---|
| Provider | OpenAI (API) | HuggingFace (local) | Google (API) |
| Dimensions | 1536 | 384 | 768 |
| Cost | Paid | Free | Paid |
| Latency | ~200ms | ~5ms (local) | ~150ms |
| Multilingual | โ Yes | โ ๏ธ Partial | โ Yes |
| Best For | Production RAG | Local / Edge | Google Cloud stack |
โ Checklist Before Moving On โ
- [ ] I can explain what a vector embedding is in plain English
- [ ] I understand why embeddings capture meaning, not just words
- [ ] I know the difference between cosine, euclidean, and dot product
- [ ] I can generate embeddings using OpenAI API in JavaScript
- [ ] I understand the chunking strategy for long documents
- [ ] I know which embedding model to pick for different scenarios
- [ ] I understand multimodal embeddings (CLIP)
๐ Further Reading โ
- OpenAI Embeddings Guide โ Official API docs
- Sentence Transformers โ Best open-source embedding library
- The Illustrated Word2Vec โ Visual intuition
- CLIP Paper (OpenAI) โ Multimodal embeddings
โก๏ธ Next: ANN & HNSW Index โ โ then Vector Databases โ
