Search code examples
cassandranosqldatastax-astravector-search

How are the results of a Cassandra Vector Search sorted?


I have a table of movies in Cassandra (hosted on Astra DB), with a lone primary key of movie_id. There are several columns, but for my vector search I really only care about the title. The movie_vector column has a storage attached index (SAI) on it, which was created with the following CQL:

CREATE CUSTOM INDEX ON movieapp.movies (movie_vector) USING 'StorageAttachedIndex';

When I execute a CQL vector search based on the vector defined for "Star Wars," I get these results:

SELECT title FROM movies
ORDER BY movie_vector ANN OF [37, 4, 8, 13, 42.1497, 8.1, 6778]
LIMIT 6;

 title                   | movie_vector
-------------------------+-------------------------------------
               Star Wars |  [37, 4, 8, 13, 42.1497, 8.1, 6778]
 The Empire Strikes Back | [37, 4, 8, 13, 19.47096, 8.2, 5998]
      Return of the Jedi | [37, 4, 8, 13, 14.58609, 7.9, 4763]
           The Lion King |    [49, 1, 3, 7, 21.60576, 8, 5520]
              Pocahontas |  [10, 1, 3, 4, 13.28007, 6.7, 1509]
                  Batman |    [18, 5, 8, 0, 19.10673, 7, 2145]

(6 rows)

How are these results sorted? Is there some way to see the logic behind that?


Solution

  • Given the defaults and the index shown above, the results returned from a CQL vector search are sorted by the similarity of the cosines of their vectors, relative to the original vector. This can be seen by using the CQL similarity_cosine function, which accepts a column of type Vector<float, n> and the vector itself.

    For the above query, it would work like this:

    SELECT title,
        similarity_cosine(movie_vector, [37, 4, 8, 13, 42.1497, 8.1, 6778]) AS similarity
    FROM movies
    ORDER BY movie_vector ANN OF [37, 4, 8, 13, 42.1497, 8.1, 6778]
    LIMIT 6;
    
     title                   | similarity | movie_vector
    -------------------------+------------+-------------------------------------
                   Star Wars |          1 |  [37, 4, 8, 13, 42.1497, 8.1, 6778]
     The Empire Strikes Back |   0.999998 | [37, 4, 8, 13, 19.47096, 8.2, 5998]
          Return of the Jedi |   0.999996 | [37, 4, 8, 13, 14.58609, 7.9, 4763]
               The Lion King |   0.999995 |    [49, 1, 3, 7, 21.60576, 8, 5520]
                  Pocahontas |   0.999995 |  [10, 1, 3, 4, 13.28007, 6.7, 1509]
                      Batman |   0.999992 |    [18, 5, 8, 0, 19.10673, 7, 2145]
    
    (6 rows)
    

    As shown above, The vector for the movie "Star Wars" is a 100% match. This makes sense, as that was the vector ([37, 4, 8, 13, 42.1497, 8.1, 6778]) used in the query.

    The remaining rows are ordered by the result of their similarity_cosine, which is based on the proximity of their movie_vector to the original vector. The rows which are closest in proximity to the original vector are at the top of the result set, while the vectors that are farther away are shown at the bottom.

    It's a bit verbose, but still a useful way to show how vector search results are sorted.