Search code examples
sqlcosine-similarity

SQL Computation of Cosine Similarity


Suppose you have a table in a database constructed as follows:

create table data (v int, base int, w_td float);
insert into data values (99,1,4);
insert into data values (99,2,3);
insert into data values (99,3,4);
insert into data values (1234,2,5);
insert into data values (1234,3,2);    
insert into data values (1234,4,3);

To be clear select * from data should output:

v   |base|w_td
--------------
99  |1   |4.0
99  |2   |3.0
99  |3   |4.0
1234|2   |5.0
1234|3   |2.0
1234|4   |3.0

Note that since the vectors are stored in a database, we need only store the non-zero entries. In this example, we only have two vectors $v_{99} = (4,3,4,0)$ and $v_{1234} = (0,5,2,3)$ both in $\mathbb{R}^4$.

The cosine similarity of those vectors should be $\displaystyle \frac{23}{\sqrt{41 \cdot 38}} = 0.5826987807288609$.

How do you compute the cosine similarity using nearly only SQL?

I say nearly because you will need the sqrt function which is not always provided in basic SQL implementations, for example it is not in sqlite3!


Solution

  • with norms as (
        select v,
            sum(w_td * w_td) as w2
        from data
        group by v
    )
    select 
        x.v as ego,y.v as v,nx.w2 as x2, ny.w2 as y2,
        sum(x.w_td * y.w_td) as innerproduct,
        sum(x.w_td * y.w_td) / sqrt(nx.w2 * ny.w2) as cosinesimilarity
    from data as x
    join data as y
        on (x.base=y.base)
    join norms as nx
        on (nx.v=x.v)
    join norms as ny
        on (ny.v=y.v)
    where x.v < y.v
    group by 1,2,3,4
    order by 6 desc
    

    yields

    ego|v   |x2  |y2  |innerproduct|cosinesimilarity
    --------------------------------------------------
    99 |1234|41.0|38.0|23.0        |0.5826987807288609