Search code examples
artificial-intelligenceopenai-apiqdrantclient

Why a difference in Score for Manhattan distance vs Cosine Distance despite same text chunk being returned?


I am using the qdrant DB and client for embedding a document as part of a PoC that I am working on in building a RAG.

I see that when I use a Manhattan distance to build the vector collection I get a high score than when I use the Cosine distance. However, the text chunk returned is the same. I am not able to understand why and how? I am learning my ropes here at RAG still. Thanks in advance.

USER QUERY

What is DoS?

COSINE DISTANCE

response: [
ScoredPoint(id=0, 
version=10, 
score=0.17464592, 
payload={
'chunk': "It also includes overhead bytes for operations, 
administration, and maintenance (OAM) purposes.\nOptical Network Unit 
(ONU)\nONU is a device used in Passive Optical Networks (PONs). It converts 
optical signals transmitted via fiber optic cables into electrical signals that 
can be used by end-user devices, such as computers and telephones. The ONU is 
located at the end user's premises and serves as the interface between the optical 
network and the user's local network."
}, 
vector=None, shard_key=None)
]

MANHATTAN DISTANCE

response: [
ScoredPoint(id=0, 
version=10, 
score=103.86209, 
payload={
'chunk': "It also includes overhead bytes for operations, administration, 
and maintenance (OAM) purposes.\nOptical Network Unit 
(ONU)\nONU is a device used in Passive Optical Networks (PONs). It converts 
optical signals transmitted via fiber optic cables into electrical signals that 
can be used by end-user devices, such as computers and telephones. The ONU is 
located at the end user's premises and serves as the interface between the optical 
network and the user's local network."
}, 
vector=None, shard_key=None)
]

Solution

  • There are many different math functions that can be used to calculate similarity between two embedding vectors:

    • Cosine distance,
    • Manhattan distance (L1 norm),
    • Euclidean distance (L2 norm),
    • Dot product,
    • etc.

    Each calculates similarity in a different way, where:

    • The Cosine distance measures the cosine of the angle between two non-zero vectors. The Cosine distance is sensitive to the direction of the vectors and is less sensitive to the magnitude.

    Cosine distance

    • The Manhattan distance measures the absolute difference between the corresponding elements of two vectors. The Manhattan distance is sensitive to the magnitude of the vectors.

    Manhattan distance

    • The Euclidean distance measures the straight-line distance between two vectors.

    Euclidean distance

    • The Dot product measures the angle between two vectors multiplied by the product of their magnitudes.

    Dot product

    Note: Image source for all four images

    Consequently, the results of similarity calculations are different, where:

    • The Cosine distance is always in the range [0, 2].
    • The Manhattan distance is always in the range [0, ∞).
    • The Euclidean distance is always in the range [0, ∞).
    • The Dot product is always in the range (-∞, ∞).

    See the table below.

    Measure Range Interpretation
    Cosine distance [0, 2] 0 if vectors are the same, 2 if they are diametrically opposite.
    Manhattan distance [0, ∞) 0 if vectors are the same, increases with the sum of absolute differences.
    Euclidean distance [0, ∞) 0 if vectors are the same, increases with the sum of squared differences.
    Dot product (-∞, ∞) Measures alignment, can be positive, negative, or zero based on vector direction.