Skip to content

πŸ“ Similarity Metrics β€” Complete Guide ​

The "ruler" of vector space: Choosing the right way to measure distance.

NOTE

Prerequisite: This guide assumes you know what Vector Embeddings β†’ are.


Why Metric Choice Matters ​

In a vector database, every search is a math problem: "Find me the vectors closest to this query vector." But "close" can be defined in different ways. Choosing the wrong metric can lead to irrelevant search results, even if your embeddings are perfect.


1. Euclidean Distance (L2) ​

Euclidean Distance is the straight-line distance between two points in high-dimensional space. It is the most intuitive metric, based on the Pythagorean theorem.

The Math ​

For two vectors and :

Visualization ​

Think of it as using a literal ruler to measure the distance between two dots on a piece of paper (2D) or in a room (3D).

βœ… When to Use: ​

  • Image Search: L2 is often superior for visual features where absolute intensity/position matters.
  • Clustering: Algorithms like K-Means natively use L2 distance.
  • Physical Data: Sensor data, GPS coordinates, or any spatial data.

⚠️ The Catch: ​

It is sensitive to magnitude. If one vector is exactly like another but twice as "long" (e.g., a short document vs. a long document containing the same words twice), L2 will see them as very far apart.


2. Cosine Similarity ​

Cosine Similarity measures the angle between two vectors, regardless of their length (magnitude). It tells you how much two vectors "point" in the same direction.

The Math ​

It calculates the cosine of the angle between vectors and :

  • 1.0: Vectors point in the same direction (identical).
  • 0.0: Vectors are orthogonal (90Β°, unrelated).
  • -1.0: Vectors point in opposite directions.

Visualization ​

Vectors are treated as arrows starting from the origin. We only care about the gap between the arrows.

βœ… When to Use: ​

  • NLP & Text: The "gold standard" for text. It handles different document lengths gracefully (a 10-page article about cats is semantically identical to a 1-page article about cats).
  • Recommendation Systems: Comparing user profiles where one user might have 100 ratings and another has 5.

⚠️ The Catch: ​

It ignores magnitude entirely. If the "intensity" of the signal matters, Cosine might discard useful information.


3. Dot Product ​

Dot Product is a raw sum of multiplications. It combines both angle and magnitude.

The Math ​

If vectors are normalized (length = 1), Dot Product is mathematically identical to Cosine Similarity.

Visualization ​

It's essentially the projection of one vector onto another.

βœ… When to Use: ​

  • Maximum Inner Product Search (MIPS): Commonly used in deep learning and ranking.
  • Performance: It is the computationally cheapest metric to calculate (no square roots or divisions).
  • Ranking: When "popularity" or "importance" is baked into the vector length.

πŸ†š Side-by-Side Comparison ​

FeatureEuclidean (L2)Cosine SimilarityDot Product
FocusDistance (Gap)Direction (Angle)Direction + Magnitude
SensitivitySensitive to lengthMagnitude invariantHighly sensitive to length
Output Range to (Lower is better) to (Higher is better) to (Higher is better)
Best ForImages, Spatial dataText, DocumentsRecommendations, Ranking

πŸ› οΈ The "Magic" of Normalization ​

In production vector databases, you will often see people use Dot Product on Normalized vectors. Why?

  1. If you normalize your vectors (force their length to be exactly 1.0), then:
    • Cosine Similarity = Dot Product.
    • Euclidean Distance becomes monotonically related to Cosine Similarity.
  2. Result: You get the semantic accuracy of Cosine Similarity with the blazing hardware speed of Dot Product.

πŸ’» Implementation in JavaScript ​

javascript
/**
 * All-in-one Similarity Library
 */
const VectorMath = {
  // 1. Euclidean Distance
  euclidean: (a, b) => {
    let sum = 0;
    for (let i = 0; i < a.length; i++) {
      sum += Math.pow(a[i] - b[i], 2);
    }
    return Math.sqrt(sum);
  },

  // 2. Dot Product
  dot: (a, b) => {
    let sum = 0;
    for (let i = 0; i < a.length; i++) {
      sum += a[i] * b[i];
    }
    return sum;
  },

  // 3. Cosine Similarity
  cosine: (a, b) => {
    const dot = VectorMath.dot(a, b);
    const magA = Math.sqrt(a.reduce((s, x) => s + x * x, 0));
    const magB = Math.sqrt(b.reduce((s, x) => s + x * x, 0));
    return dot / (magA * magB);
  },

  // Helper: Normalize a vector to length 1.0
  normalize: (v) => {
    const mag = Math.sqrt(v.reduce((s, x) => s + x * x, 0));
    return v.map((x) => x / mag);
  },
};

// Usage
const v1 = [1, 2, 3];
const v2 = [2, 4, 6]; // v2 is v1 but 2x longer

console.log(VectorMath.cosine(v1, v2)); // 1.0 (Identical direction)
console.log(VectorMath.euclidean(v1, v2)); // 3.74 (Very different distance)

πŸ—ΊοΈ Decision Flowchart ​


βœ… Checklist Before Moving On ​

  • [ ] I understand that Cosine cares about direction, while L2 cares about position.
  • [ ] I know why text search usually defaults to Cosine.
  • [ ] I can explain why magnitude-sensitivity makes L2 tricky for documents of different lengths.
  • [ ] I understand that normalizing vectors makes Dot Product and Cosine effectively the same.

➑️ Next: ANN & HNSW Index β†’

Released under the ISC License.