Search code examples
solrlucenecosine-similaritydot-product

Dense Vector Search with Solr 9.4 - Incorrect dot product and cosine values returned by the knn search


I am experimenting Dense Vector Search with Solr 9.4 but weird dot product values are returned by the knn search.

Here is a basic example :

  1. store a vector [0.57735027,0.57735027,0.57735027] in a collection
  2. perform a knn query with the query vector [0.26726124, 0.53452248, 0.80178373]

The dot product should be 0.92582 but the returned score is 0.96291006

The weirdest part is that when I use a streaming expression with the expression dotProduct(array(0.57735027,0.57735027,0.57735027),array(0.26726124,0.53452248,0.80178373)), Solr return the right value : 0.92582

Any idea why there is such a difference and how could I obtain the right dot product from knn search ?

Steps to reproduce

Start a local Solr

There is my docker-compose.yaml file :

version: '3'
services:
  solr:
    image: solr:9.4
    ports:
     - "8983:8983"
    volumes:
      - 'solr_data:/var/solr'
    command:
      - solr-precreate
      - documents
volumes:
  solr_data:
    driver: local

Add a vector to the collection

I add a single vector [0.57735026, 0.57735026, 0.57735026] (unit vector).

# Create a 3D vector type 
curl  -X POST \
  'http://localhost:8983/api/cores/documents/schema' \
  --header 'Content-Type: application/json' \
  --data-raw '{
  "add-field-type": {
    "name": "3D-vector",
    "class": "solr.DenseVectorField",
    "vectorDimension": "3",
    "vectorEncoding": "FLOAT32",
    "similarityFunction": "dot_product"
  }
}'

# Add a field "vector" in the collection
curl  -X POST \
  'http://localhost:8983/api/cores/documents/schema' \
  --header 'Content-Type: application/json' \
  --data-raw '{
  "add-field": [
    {
      "name": "vector",
      "type": "3D-vector"
    }
  ]
}'

# Add a single vector (normalized) into the collection "documents"
curl  -X POST \
  'http://localhost:8983/api/cores/documents/update?commit=true' \
  --header 'Content-Type: application/json' \
  --data-raw '[
  {
    "vector": [
      0.57735027,
      0.57735027,
      0.57735027
    ]
  }
]'

Perform a knn search

Now I perform a knn search with a vector query : [0.26726124, 0.53452248, 0.80178373]

The corresponding dot product should be 0.92582 (same as cosine similarity since I use normalized vectors).

I add a computed field that is using the function query vectorSimilarity in order to double check the returned value of the dot product :

Response :

{
  "responseHeader": {
    "status": 0,
    "QTime": 1,
    "params": {
      "json": "{\n  \"fields\": [\n    \"vector\",\n    \"score\",\n    \"vectorSimilarity(FLOAT32, DOT_PRODUCT, vector, [0.26726124, 0.53452248, 0.80178373])\"\n  ],\n  \"query\": \"{!knn f=vector topK=10}[0.26726124, 0.53452248, 0.80178373]\"\n}"
    }
  },
  "response": {
    "numFound": 1,
    "start": 0,
    "maxScore": 0.96291006,
    "numFoundExact": true,
    "docs": [
      {
        "vector": [
          0.57735026,
          0.57735026,
          0.57735026
        ],
        "score": 0.96291006,
        "vectorSimilarity(FLOAT32, DOT_PRODUCT, vector, [0.26726124, 0.53452248, 0.80178373])": 0.96291006
      }
    ]
  }
}

As we can see the returned value for dot product is 0.96291006 which is significantly different from 0.92582.

The weirdest thing is that if I use the streaming expression endpoint with the expression dotProduct(array(0.57735027,0.57735027,0.57735027),array(0.26726124,0.53452248,0.80178373)), Solr compute the right dot product :

curl  -X GET \
  'http://localhost:8983/solr/documents/stream?expr=dotProduct(array(0.57735027%2C0.57735027%2C0.57735027)%2Carray(0.26726124%2C0.53452248%2C0.80178373))' \
  --header 'Content-Type: application/json

Response :

{
  "result-set": {
    "docs": [
      {
        "return-value": 0.9258201002207116
      },
      {
        "EOF": true,
        "RESPONSE_TIME": 15
      }
    ]
  }
}

Solution

  • I have finally understood why the scores seem incorrect thanks to this issue.

    It appears that Solr is computing a normalized cosine similarity : (1 + cosine_sim) / 2 which explains why there is a gap between the value I computed and the one returned by the knn search.

    To get back the cosine similarity, one can apply the formula : 2 * normalized_cosine_sim - 1.

    For the exemple I gave in my question : 2 * 0.96291006 - 1 gives 0.92582