Search code examples
algorithmsimilaritycosine-similarity

How to calculate similarity between two arrays?


I have two arrays like these: a1=[8,6,4,8,8,9] , a2=[7,3,8,4,3,9,9,5,8,3,5,8] . They may contain different number of integers that can be repeated. How can I calculate similarity between these? Which metric should I use?

UPDATE: numbers represent vote from 1 to 10. Users vote items in one specific category. We don't know items they have been voted. We only know that these items are belong to the same category. So we have arrays of votes in category. The problem is how we could calculate similarity between these users.


Solution

  • Treat your votes as random samples

    If you really don't know what they voted for, the only thing you have is the distribution of votes. I.e., you have two samples from two distributions and you need to evaluate the difference between the distributions.

    The simplest approach would be to count the number of times the user gave a given vote, i.e., convert [8,6,4,8,8,9] to [0,0,0,1,0,1,0,3,9,0] (i.e., 3 votes of 8 means 3 in the 8th position). Now your vectors have the same length and you can use cosine similarity.

    Fuzzy your data

    If you do not have much data, i.e., you really need to compare people who voted 1-2 times, you can try "fuzzying the votes", i.e., treating a vote for 8 as 1/2 vote for 8, and 1/4 votes for each of 7 and 9. E.g., your vectors [0,0,0,0,0,0,0,0,0,1] and [0,0,0,0,0,0,0,0,1,0] will become [0,0,0,0,0,0,0,0,0.33,0.66] and [0,0,0,0,0,0,0,0.25,0.5,0.25].

    This is equivalent to using an "unusual" dot product: instead of the simple (v,u)=sum_i(v_i*u_i*), use (Av,u)=sum_ij(a_ij*v_i*u_j) where A is an almost diagonal matrix (e.g., a_ii=4, a_ij=1 if |i-j|=1, a_ij=0 otherwise). Then the new cosine similarity is then defined as

    CS(u,v)=arccos( (Av,u) / sqrt( (Av,v) * (Au,u) ) )
    

    For the tridiagonal example above, the formula looks like:

    (Av,u) = 4*sum(v_i,u_i) + sum(v_i,u_{i-1}) + (v_i,u_{i+1})
    

    Use Stats

    If you do have a lot of data, i.e., every person has votes every number at least 5 times (i.e., each length 10 vector has all components >=5), then you can use Chi-squared test or, better yet, Likelihood-ratio test.

    recommender system

    You should specify which coordinates match (if you, as I suspect, doing a recommender system). E.g., if user 1 votes are [3] and user 2 votes are [4,5], you need to know whether score 3 is for the same object as score 4 or 5 or for an entirely different object.