Search code examples
pythonmysqldatabaseinformation-retrievalsimilarity

Similarity Between Users Based On Votes


lets say i have a set of users, a set of songs, and a set of votes on each song:

=========== =========== =======
User        Song        Vote
=========== =========== =======
user1       song1       [score]
user1       song2       [score]
user1       song3       [score]
user2       song1       [score]
user2       song2       [score]
user2       song3       [score]
user3       song1       [score]
user3       song2       [score]
user3       song3       [score]
user-n      song-n      [score]
=========== =========== =======

whats the most efficient way to calculate user similarity based on song-votes? is there a better way than iterating over every user and every vote for every song?


Solution

  • There are two common metrics that can be used to find similarities between users:

    1. Euclidean Distance, that is exactly what you are thinking: imagine a n-dimensional graph that has for each axis a song that is reviewed by two involved users (u1 and *u2) and the value on its axis is the score. You can easily calculate similarity using the formula:

      for every song reviewed by u1 and u2, calculate pow(u1.song.score - u2.song.score, 2) and add all together into sum_of_powers. Similarity coefficient is then given by 1 / 1 + (sqrt(sum_of_powers)).

    2. Pearson Correlation (or correlation coefficient): it's a better approach that finds how much two data sets are related one with another. This approach uses more complex formulas and a little of statistics background, check it here: wiki. You will have a graph for every couple of users, then you plot points according to scores.. for example if aSong has been voted 2 from u1 and 4 from u2 it will plot the point (2,4) (assuming that user1 is x-axis and u2 is y-axis).

    Just to clarify, you use linear regression to find two coefficients A and B, that describe the line that minimizes the distance from all the points of the graph. This line has this formula:y = Ax + B. If two sets are similar points should be near to the main diagonal so A should tend to 1 while B to 0. Don't assume this explaination as complete or as a reference because it lacks soundness and typical math formalism, it just to give you an idea.

    EDIT: like written by others, more complex algorithms to cluster data exist, like k-means but I suggest you to start from easy ones (actually you should need something more difficult just when you realize that results are not enough).