Search code examples
phpmysqlrankingbayesianrating-system

Adding an extra factor (number of clicks) to a Bayesian ranking system


I run a music website for amateur musicians where we have a rating system based on a score out of 10, which is then calculated into an overall score out of 100. We have a "credibility" points system for users which directly influences the average score at the point of rating, but the next step is to implement a chart system which uses this data effectively.

I'll try and explain exactly how it all works so you can see which data I have at my disposal.

  • A site member rates a track between 1 and 10.
  • That site member has a "credibility" score, which is just a total of points accumulated for various activities around the site. A user gains, for example, 100 points for giving a rating so the more ratings they give, the higher their "credibility" score. Only the total credibility score is saved in the database, updated each time a user performs an activity with a points reward attached. These individual activities are not stored.
  • Based on the credibility of this user compared to other users who have rated the track, a weighted average is calculated for the track, which is then stored as a number between 1 and 100 in the tracks table.
  • In the tracks table, the number of times a track is listened to (i.e. number of plays) is also stored as a total.

So the data I have to work with is:

  • Overall rating for the track (number between 1 and 100)
  • Number of ratings for the track
  • Number of plays for the track

In the chart system I want to create a ranking that uses the above 3 sets of data to create a fair balance between quality (overall rating, normalized with number of ratings) and popularity (number of plays). BUT the system should factor quality more heavily than popularity, so for example the quality aspect makes up 75% of the normalized ranking and popularity 25%.

After a search on this site I found the IMDB Bayesian-style system which is helpful for working out the quality aspect, but how do I add in the popularity (number of plays) and have it balanced in the way I want?

The site is written in PHP and MySQL if that helps.

EDIT: the title says "number of clicks" but this is basically the direct equivalent of "number of plays".


Solution

  • You may want to try the following. The IMDB equation you mentioned uses weighing to lean toward either the average rating of the movie or the average rating of all movies:

    WR = (v/(v+m)) × R + (m/(v+m)) × C 
    

    So

    v << m => v/(v+m) -> 0; m/(v+m) -> 1 => WR -> C
    

    and

    v >> m => v/(v+m) -> 1; m/(v+m) -> 0 => WR -> R
    

    This should generally be fair. Calculating a popularity score between 0 and 100 based on the number of plays is pretty tricky unless you really know your data. As a first try calculate the average number of plays avg(p) and the variance var(p) you can then use these to scale the number of plays using a technique call whitening:

    WHITE(P) = (p - avg(p))/var(p)
    

    This will give you a score between -1 and 1 by assuming your data looks like a bell curve. You can then scale this to be in the range 0 - 100 by scaling again:

    POP = 50 * (1 + WHITE(P))
    

    To combine the score based on some weighting factor w (e.g. 0.75) you'd simply do:

    RATING = w x WR + (1 - w) x POP
    

    Play with these and let me know how you get on.

    NOTE: this does not account for the fact that a use can "game" the popularity buy playing a track many times. You could get around this by penalising multiple plays of a single song:

    deltaP = (1 - (Puser - 1)/TPuser) Where:

    • deltaP = Change in # plays
    • Puser = number of time this user has played this track
    • TPuser = total number of tracks (not unique) played by the user

    So the more times a user plays just the one track the less it counts toward the total number of plays for that track. If the users listening habits are diverse then TPuser will be large and so deltaP will tend back to 1. This still can be gamed but is a good start.