Search code examples
pythonelasticsearchnlp

What is the right way to get unit vector to index Elasticsearch ann dot_product?


I am trying to index word embedding vectors to Elasticsearch V8 ann dense_vector dot_product.

I can successfully index vec to cosine, so I converted it to unit vector with numpy for dot_product.

    unit_vector = vec / np.linalg.norm(vec)

but I get an 400 error saying like this.

The [dot_product] similarity can only be used with unit-length vectors. Preview of invalid vector: [-0.0038341882, -0.1564709, 0.08771773, -0.14555556, -0.07952896, ...]

Am I missing something?


Solution

  • I was confronted with the exact same problem and I found a solution after much experimentation.

    In my case, when indexing lots of embeddings to Elasticsearch (dense_vector with similarity parameter set to dot_product), most of them got indexed properly and a small percentage of them failed with The [dot_product] similarity can only be used with unit-length vectors.

    I found after intensive testing that the problem was that the unit vectors I was working with were of numerical types np.float16 and this was causing the error. Working with np.float32 as a numerical type in my workflow for my unit vectors solved the issue.