Search code examples
postgresqlvectoreuclidean-distancecosine-similaritypostgresql-11

How to get cosine distance between two vectors in postgres?


I am wondering if there is a way to get cosine distance of two vectors in postgres. For storing vectors I am using CUBE data type.

Below is my table definition:

test=# \d vectors                                                                                                                                
                            Table "public.vectors"
 Column |  Type   | Collation | Nullable |               Default               
--------+---------+-----------+----------+-------------------------------------
 id     | integer |           | not null | nextval('vectors_id_seq'::regclass)
 vector | cube    |           |          | 

Also, sample data is given below:

test=# select * from vectors order by id desc limit 2;
   id    |                  vector                  
---------+------------------------------------------
 2000000 | (109, 568, 787, 938, 948, 126, 271, 499)
 1999999 | (139, 365, 222, 653, 313, 103, 215, 796)

I actually can write my own PLPGSql function for this, but wanted to avoid this as it might not be efficient.


Solution

  • About your table

    First of all, I believe you should change your data type to plain array.

    CREATE TABLE public.vector ( 
      id serial NOT NULL,
      vctor double precision [3] --for three dimensional vectors; of course you can change the dimension or leave it unbounded if you need it.
     );
    
    INSERT INTO public.vector (vctor) VALUES (ARRAY[2,3,4]);
    INSERT INTO public.vector (vctor) VALUES (ARRAY[3,4,5]);
    

    So

    SELECT * FROM public.vector;
    

    Will result in the following data

       id |   vctor
    ------|---------
        1 | {2,3,4}
        2 | {3,4,5}
    

    Maybe not the answer you expected but consider this

    As you may know already, calculating the cosine between the vectors involves calculating the magnitudes. I don't think the problem is the algorithm but the implementation; it requires calculating squares and square roots that is expensive for a RDBMS.

    Now, talking about efficiency; the server process does not take the load when calling mathematical functions. In PostgreSQL, the mathematical functions (look here) run from the C library so they are pretty efficient. However, in the end, the host has to assign some resources to make these calculations.

    I would indeed think carefully before implementing these rather costly operations inside the server. But there is not a right answer; it depends on how you are using the database. For example if it is a production database with thousands of concurrent users, I would move this kind of calculation elsewhere (a middle layer or a user application.) But if there are few users and your database is for a small research operation, then it is fine to implement it as a stored procedure or a process running inside your server but keep in mind this will affect scalability or portability. Of course, there are more considerations like how many rows will be processed, or whether or not you intend to fire triggers, etc.

    Consider other alternatives

    Make a client app

    You can do a fast and decent program in VB or the language of your choice. And let the client app make the heavy calculation and use the database for what it does best that is storing and retrieving data.

    Store the data differently

    For this particular example, you could store the unit vectors plus the magnitude. In this way, finding the cosine between any two vectors reduces simply to the dot product of the unit vectors (only multiplication and division and no squares nor square roots.)

    CREATE TABLE public.vector ( 
         id serial NOT NULL,
         uvctor double precision [3], --for three dimensional vectors; of course you can change the dimension or make it decimal if you need it
         magnitude double precision
     ); 
    
    INSERT INTO public.vector (vctor) VALUES (ARRAY[0.3714, 0.5571, 0.7428], 5.385); -- {Ux, Uy, Uz}, ||V|| where V = [2, 3, 4];
    INSERT INTO public.vector (vctor) VALUES (ARRAY[0.4243, 0.5657, 0.7071], 7.071); -- {Ux, Uy, Uz}, ||V|| where V = [3, 4, 5];
    
    SELECT a.vctor as a, b.vctor as b, 1-(a.uvctor[1] * b.uvctor[1] + a.uvctor[2] * b.uvctor[2] + a.uvctor[3] * b.uvctor[3]) as cosine_distance FROM public.vector a
    JOIN public.vector b ON a.id != b.id;
    

    Resulting in

                              a  |                           b  | cosine_distance
    -----------------------------|------------------------------|------------------
    {0.3714,0.5571,0.7428,5.385} | {0.4243,0.5657,0.7071,7.071} |      0.00202963
    {0.4243,0.5657,0.7071,7.071} | {0.3714,0.5571,0.7428,5.385} |      0.00202963
    

    Even if you have to calculate the magnitude of the vector inside the server, you will make it once per vector and not every time you need to get the distance between two of them. This becomes more important as the number of rows is increasing. For 1000 vectors for example, you would have to calculate the magnitude 999000 times if you wanted to obtain the cosine difference between any two vectors using the original vector components.

    Any combination of the above

    Conclusion

    When we pursue efficiency, most of the times there is not a canonical answer. Instead we have trade-offs that we have to consider and evaluate. It always depends on the ultimate goal we need to achieve. Databases are excellent for storing and retrieving data; they can definitely make other things but that comes with an added cost. If we can live with the added overhead then it's fine; otherwise we have to consider alternatives.