Search code examples
elasticsearchvectordot-product

Why does elasticsearch use a rather odd `dot_product` scoring formula


Looking at the elasticsearch docs for a dense_vector field type, there is a dot_product scoring option which is defined as:

0.5 + (dot_product(query, vector) / (32768 * dims))

My question is, why is it not just (dot_product(query, vector)? Where does the 32768 come from?

Any help appreciated!


Solution

  • Tldr;

    Although it is in the documentation of Elasticsearch, this formula comes from Lucene, which Es depends on.

    As to why this function is used to compare vectors.

    My Interpretation

    The dot product is a sum of product of each components of 2 vectors. This can mean that the value maybe be huge. Since we want to compare vectors they want a score, a score with boundaries.

    Dot product score computed over signed bytes, scaled to be in [0, 1]. Source

    Since this kind of vector should only have signed bytes ranging from [-128, 127].

    Your dot product may end up being a sum of 128^2 = 16384

    In order to scale it over [0, 1].

    You would need to dot_product(query, vector) / (16384 * dims).

    Except that dot_product(query, vector) may be negative.

    At the moment you are scaling from [-1, 1].

    How about scaling the dot_product in [-0.5, 0.5] and adding 0.5.

    So you end up with the following:

    0.5 + dot_product(query, vector) / (2 * 16384 * dims).

    Oh snap ! what do we find ^^:

    2 * 16384 = 32768