Search code examples
cosine-similarityweaviateapproximate-nn-searching

Weaviate - top hits for with_near_vector() doesn't include the record whose vector perfectly matches query vector


I have a very large Weaviate vector storage class (700,000 records) in which I pass my own custom vectors. I’m trying to get distances against a vector I pass as below. The vector is actually a match to one of the records, so I know the top hit should be the record with the identical vector (distance very close to 0). However, when I ask for top hits, the “closest” record is returning a distance of around 0.10, and this record is definitely not the record that matches my query vector perfectly (node_type="type1" instead of "type2").

# NOTE: mean_emb is a numpy array that matches a record pushed to the MyClass weaviate class.
# This theoretically should return distances from all 700k records to specified vector, since "distance" = 1.0, but I get why it wouldn't computationally
result = (client.query.get("MyClass", ["message", "node_type", "my_id", "timestamp"])
          .with_near_vector({"vector": mean_emb.tolist(), "distance": 1.0})
          .with_additional(["vector", "distance"]).do())
result = result["data"]["Get"]["MyClass"]
print(len(result))  # only 11,100 distances are returned

It looks like with_offset() doesn’t like it when the offset is >100,000.

I have tried pagination using with_after() but with_after doesn’t support queries with with_near_vector(), and I have also tried with_offset() + with_limit(), but this is terribly slow. Is there a workaround / what am I doing wrong here / how to query my class so that my top N query includes the true record match (distance close to 0)?

To prove there is in fact a record with distance ~0.000. Here’s the query that highlights the records that matches the vector:

where_filter = {"path": ["node_type"], "operator": "Equal", "valueText": "type2"}
result = (client.query.get("MyClass", ["message", "node_type", "my_id", "timestamp"])
          .with_near_vector({"vector": mean_emb.tolist()})
          .with_additional(["distance", "id"]).with_where(where_filter).do())
print(result)

Gives me this (I’ve changed the values of some of the record meta-data to protect data):

{'data': {'Get': {'MyClass': [{'_additional': {'distance': -1.9073486e-06,
      'id': 'fdb00f95-2c07-462c-84cd-9380c6777801'},
     'my_id': 'Record that matches the vector passed',
     'message': None,
     'node_type': 'type2',
     'timestamp': None},
    {'_additional': {'distance': 0.6122676,
      'id': '0deb152a-eef0-485c-ad6e-c9e29f9a3915'},
     'my_id': 'Another type2 record that doesn't match vector passed',
     'message': None,
     'node_type': 'type2',
     'timestamp': None}]}}}

Solution

  • If you insist on having the the exact match appear at the top, then, you need to tweak the indexing algorithm (HNSW) to increase recall. Try increasing the value of the ef paramater.