Search code examples
mathstatisticsrating-systemscoring

unrated versus negative-rated entities with Wilson score -- how to handle?


Having read How Not To Sort By Average Rating I thought I should give it a try.

CREATE FUNCTION `mydb`.`LowerBoundWilson95` (pos FLOAT, neg FLOAT)
RETURNS FLOAT DETERMINISTIC
RETURN
IF(
    pos + neg <= 0,
    0,
    (
        (pos + 1.9208) / (pos + neg)
        -
        1.96 * SQRT(
            (pos * neg) / (pos + neg) + 0.9604
        )
        / (pos + neg)
    )
    /
    (
        1 + 3.8416
        / (pos + neg)
    )
);

Running some tests, I discover that objects with pos=0 and neg>0 have very small, but non-negative scores, whereas an object with pos=neg=0 has a score of zero, ranking lower.

I am of the opinion that an unrated object should be listed above one which has no positive ratings but some negatives.

I reasoned that "the individual ratings are all really expressions of deviation from some baseline, so I'll move the baseline, I'll give every object a 'neutral' initial score," so I came up with this:

CREATE FUNCTION `mydb`.`AdjustedRating` (pos FLOAT, neg FLOAT)
RETURNS FLOAT DETERMINISTIC
RETURN
(
    SELECT `mydb`.`LowerBoundWilson95` (pos+4, neg+4)
);

Here are some sample outputs for AdjustedRating

  \  pos  0       1       2
neg
 0   | 0.215 | 0.188 | 0.168
 1   | 0.266 | 0.235 | 0.212
 2   | 0.312 | 0.280 | 0.235

This is closer to the sort of scores I want and as a numerical hack I guess it's workable, but I can't mathematically justify it

Is there a better way, a "right" way?


Solution

  • The problem arises because this approximation (lower confidence bound) is really meant for identifying the highest rated items of a list. If you were interested in the lowest ranked, you could take the upper confidence bound instead.

    Alternatively, we use Bayesian statistics which is the formalization of exactly the second method you describe. Evan Miller actually had a followup post to this in which he said:

    The solution I proposed previously — using the lower bound of a confidence interval around the mean — is what computer programmers call a hack. It works not because it is a universally optimal solution, but because it roughly corresponds to our intuitive sense of what we'd like to see at the top of a best-rated list: items with the smallest probability of being bad, given the data.

    Bayesian statistics lets us formalize this intuition...

    Using the Bayesian ranking approach, any point that has zero data would fall back to the prior mean (what you refer to as the initial score) and then move away from it as it collects data. This is also the approach used at IMDB to compute their top Movies lists. https://math.stackexchange.com/questions/169032/understanding-the-imdb-weighted-rating-function-for-usage-on-my-own-website

    The specific method you suggest of crediting each object 4 upvotes and 4 downvotes is equivalent to putting a mean of 0.5 with a weight of 8 votes. Given an absence of any other data, this is a reasonable start. Laplace famously argued in the sunrise problem that events should be credited with 1 success and 1 failure. In the item ranking problem, we have a lot more knowledge, so it makes sense to set the prior mean equal to the average ranking. The weight of this prior mean (or how fast you move off it as a function of data, also called the prior variance) can be challenging to set.

    For IMDB's ranking of the Top 250 Movies, they use a mean movie ranking of 7.1 with a weight of 25000 votes, which is equivalent to treating all movies as if they started with 25000 "free" votes with a rating of 7.1