📐 Similarity Metrics — Complete Guide

The "ruler" of vector space: Choosing the right way to measure distance.

NOTE

Prerequisite: This guide assumes you know what Vector Embeddings → are.

Why Metric Choice Matters

In a vector database, every search is a math problem: "Find me the vectors closest to this query vector." But "close" can be defined in different ways. Choosing the wrong metric can lead to irrelevant search results, even if your embeddings are perfect.

1. Euclidean Distance (L2)

Euclidean Distance is the straight-line distance between two points in high-dimensional space. It is the most intuitive metric, based on the Pythagorean theorem.

The Math

For two vectors and :

Visualization

Think of it as using a literal ruler to measure the distance between two dots on a piece of paper (2D) or in a room (3D).

✅ When to Use:

Image Search: L2 is often superior for visual features where absolute intensity/position matters.
Clustering: Algorithms like K-Means natively use L2 distance.
Physical Data: Sensor data, GPS coordinates, or any spatial data.

⚠️ The Catch:

It is sensitive to magnitude. If one vector is exactly like another but twice as "long" (e.g., a short document vs. a long document containing the same words twice), L2 will see them as very far apart.

2. Cosine Similarity

Cosine Similarity measures the angle between two vectors, regardless of their length (magnitude). It tells you how much two vectors "point" in the same direction.

The Math

It calculates the cosine of the angle between vectors and :

1.0: Vectors point in the same direction (identical).
0.0: Vectors are orthogonal (90°, unrelated).
-1.0: Vectors point in opposite directions.

Visualization

Vectors are treated as arrows starting from the origin. We only care about the gap between the arrows.

✅ When to Use:

NLP & Text: The "gold standard" for text. It handles different document lengths gracefully (a 10-page article about cats is semantically identical to a 1-page article about cats).
Recommendation Systems: Comparing user profiles where one user might have 100 ratings and another has 5.

⚠️ The Catch:

It ignores magnitude entirely. If the "intensity" of the signal matters, Cosine might discard useful information.

3. Dot Product

Dot Product is a raw sum of multiplications. It combines both angle and magnitude.

The Math

If vectors are normalized (length = 1), Dot Product is mathematically identical to Cosine Similarity.

Visualization

It's essentially the projection of one vector onto another.

✅ When to Use:

Maximum Inner Product Search (MIPS): Commonly used in deep learning and ranking.
Performance: It is the computationally cheapest metric to calculate (no square roots or divisions).
Ranking: When "popularity" or "importance" is baked into the vector length.

🆚 Side-by-Side Comparison

Feature	Euclidean (L2)	Cosine Similarity	Dot Product
Focus	Distance (Gap)	Direction (Angle)	Direction + Magnitude
Sensitivity	Sensitive to length	Magnitude invariant	Highly sensitive to length
Output Range	to (Lower is better)	to (Higher is better)	to (Higher is better)
Best For	Images, Spatial data	Text, Documents	Recommendations, Ranking

🛠️ The "Magic" of Normalization

In production vector databases, you will often see people use Dot Product on Normalized vectors. Why?

If you normalize your vectors (force their length to be exactly 1.0), then:
- Cosine Similarity = Dot Product.
- Euclidean Distance becomes monotonically related to Cosine Similarity.
Result: You get the semantic accuracy of Cosine Similarity with the blazing hardware speed of Dot Product.

💻 Implementation in JavaScript

javascript

/**
 * All-in-one Similarity Library
 */
const VectorMath = {
  // 1. Euclidean Distance
  euclidean: (a, b) => {
    let sum = 0;
    for (let i = 0; i < a.length; i++) {
      sum += Math.pow(a[i] - b[i], 2);
    }
    return Math.sqrt(sum);
  },

  // 2. Dot Product
  dot: (a, b) => {
    let sum = 0;
    for (let i = 0; i < a.length; i++) {
      sum += a[i] * b[i];
    }
    return sum;
  },

  // 3. Cosine Similarity
  cosine: (a, b) => {
    const dot = VectorMath.dot(a, b);
    const magA = Math.sqrt(a.reduce((s, x) => s + x * x, 0));
    const magB = Math.sqrt(b.reduce((s, x) => s + x * x, 0));
    return dot / (magA * magB);
  },

  // Helper: Normalize a vector to length 1.0
  normalize: (v) => {
    const mag = Math.sqrt(v.reduce((s, x) => s + x * x, 0));
    return v.map((x) => x / mag);
  },
};

// Usage
const v1 = [1, 2, 3];
const v2 = [2, 4, 6]; // v2 is v1 but 2x longer

console.log(VectorMath.cosine(v1, v2)); // 1.0 (Identical direction)
console.log(VectorMath.euclidean(v1, v2)); // 3.74 (Very different distance)

🗺️ Decision Flowchart

✅ Checklist Before Moving On

[ ] I understand that Cosine cares about direction, while L2 cares about position.
[ ] I know why text search usually defaults to Cosine.
[ ] I can explain why magnitude-sensitivity makes L2 tricky for documents of different lengths.
[ ] I understand that normalizing vectors makes Dot Product and Cosine effectively the same.

➡️ Next: ANN & HNSW Index →

📐 Similarity Metrics — Complete Guide ​

Why Metric Choice Matters ​

1. Euclidean Distance (L2) ​

The Math ​

Visualization ​

✅ When to Use: ​

⚠️ The Catch: ​

2. Cosine Similarity ​

The Math ​

Visualization ​

✅ When to Use: ​

⚠️ The Catch: ​

3. Dot Product ​

The Math ​

Visualization ​

✅ When to Use: ​

🆚 Side-by-Side Comparison ​

🛠️ The "Magic" of Normalization ​

💻 Implementation in JavaScript ​

🗺️ Decision Flowchart ​

✅ Checklist Before Moving On ​

📐 Similarity Metrics — Complete Guide

Why Metric Choice Matters

1. Euclidean Distance (L2)

The Math

Visualization

✅ When to Use:

⚠️ The Catch:

2. Cosine Similarity

The Math

Visualization

✅ When to Use:

⚠️ The Catch:

3. Dot Product

The Math

Visualization

✅ When to Use:

🆚 Side-by-Side Comparison

🛠️ The "Magic" of Normalization

💻 Implementation in JavaScript

🗺️ Decision Flowchart

✅ Checklist Before Moving On