Search code examples
vespa

How to perform exact nearest neighbors search in Vespa?


I have such schema

schema embeddings {
  document embeddings {
    field id type int {}
    field text_embedding type tensor<double>(d0[960]) {
      indexing: attribute | index
      attribute {
        distance-metric: euclidean
      }
    }
  }

  rank-profile distance {
    num-threads-per-search:1
    inputs {
      query(query_embedding) tensor<double>(d0[960])
    }
    first-phase {
      expression: distance(field, text_embedding)
    }
  }
}

and such query body:

body = {
    'yql': 'select * from embeddings where ({approximate:false, targetHits:10} nearestNeighbor(text_embedding, query_embedding));',
    "hits":10,
    'input': {
        'query(query_embedding)': [...],
    },
    'ranking': {
        'profile': 'distance',
    },
}

The thing is the output of this query returns different results depending on targetHits parameter. For example, the top-1 distance for targetHits: 10 is 2.847000, and the top-1 distance for targetHits: 200 is 3.028079.

More of that, if I perform the same query using vespa cli:

vespa query -t http://query "select * from embeddings where ([{\"targetHits\":10}] nearestNeighbor(text_embedding, query_embedding));" \
   "approximate=false" \
   "ranking.profile=distance" \
   "ranking.features.query(query_embedding)=[...]"

I'm receiving the third result:

{
    "root": {
        "id": "toplevel",
        "relevance": 1.0,
        "fields": {
            "totalCount": 10
        },
        "coverage": {
            "coverage": 100,
            "documents": 1000000,
            "full": true,
            "nodes": 1,
            "results": 1,
            "resultsFull": 1
        },
        "children": [
            {
                "id": "id:embeddings:embeddings::926288",
                "relevance": 0.8158006540357854,
    ...

where as we can see top-1 distance is 0.8158

So, how can I perform the exact and not approximate nearest neighbors search, which results do not depend on any parameters?


Solution

  • Vespa sorts results by descending relevance score. When you use the distance rank-feature instead of closeness as the relevance score (your first-phase ranking expression), you end up inverting the order, so that more distant (worse) neighbors are ranked higher. As you increase targetHits you get even worse neighbors.

    The correct query syntax for exact search is to set approximate:false:

    select * from embeddings where ({approximate:false, targetHits:10} nearestNeighbor(text_embedding, query_embedding));
    

    But you want to use closeness(field, text_embedding) in your first-phase ranking expression.

    From https://docs.vespa.ai/en/nearest-neighbor-search.html

    The closeness(field, image_embedding) is a rank-feature calculated by the nearestNeighbor query operator. The closeness(field, tensor) rank feature calculates a score in the range [0, 1], where 0 is infinite distance, and 1 is zero distance. This is convenient because Vespa sorts hits by decreasing relevancy score, and one usually want the closest hits to be ranked highest. The first-phase is part of Vespa’s phased ranking support. In this example the closeness feature is re-used and documents are not re-ordered.