Looking at the elasticsearch docs for a dense_vector
field type, there is a dot_product
scoring option which is defined as:
0.5 + (dot_product(query, vector) / (32768 * dims))
My question is, why is it not just (dot_product(query, vector)
? Where does the 32768
come from?
Any help appreciated!
Although it is in the documentation of Elasticsearch
, this formula comes from Lucene
, which Es
depends on.
As to why this function is used to compare vectors.
The dot product is a sum of product of each components of 2 vectors. This can mean that the value maybe be huge. Since we want to compare vectors they want a score, a score with boundaries.
Dot product score computed over signed bytes, scaled to be in [0, 1]. Source
Since this kind of vector should only have signed bytes
ranging from [-128, 127].
Your dot product may end up being a sum of 128^2
= 16384
In order to scale it over [0, 1].
You would need to dot_product(query, vector) / (16384 * dims)
.
Except that dot_product(query, vector)
may be negative.
At the moment you are scaling from [-1, 1].
How about scaling the dot_product in [-0.5, 0.5] and adding 0.5.
So you end up with the following:
0.5 + dot_product(query, vector) / (2 * 16384 * dims)
.
Oh snap ! what do we find ^^:
2 * 16384
= 32768