Search code examples
pythonalgorithmpearson

What is wrong with the pearson algorithm from “Programming Collective Intelligence”?


This function is from the book "Programming Collective Intelligence”, and is supposed to calculate the Pearson correlation coefficient for p1 and p2, which is supposed to be a number between -1 and 1.

If two critics rate items very similarly the function should return 1, or close to 1.

With real user data I sometimes get weird results. In the following example the dataset critics2 should return 1 - instead it returns 0.

Does anyone spot a mistake?

(This is not a duplicate of What is wrong with this python function from “Programming Collective Intelligence”)

from __future__ import division
from math import sqrt

def sim_pearson(prefs,p1,p2):
    si={}
    for item in prefs[p1]: 
        if item in prefs[p2]: si[item]=1
    if len(si)==0: return 0
    n=len(si)
    sum1=sum([prefs[p1][it] for it in si])
    sum2=sum([prefs[p2][it] for it in si])
    sum1Sq=sum([pow(prefs[p1][it],2) for it in si])
    sum2Sq=sum([pow(prefs[p2][it],2) for it in si]) 
    pSum=sum([prefs[p1][it]*prefs[p2][it] for it in si])
    num=pSum-(sum1*sum2/n)
    den=sqrt((sum1Sq-pow(sum1,2)/n)*(sum2Sq-pow(sum2,2)/n))
    if den==0: return 0
    r=num/den
    return r

critics = {
    'user1':{
        'item1': 3,
        'item2': 5,
        'item3': 5,
        },
    'user2':{
        'item1': 4,
        'item2': 5,
        'item3': 5,
        }
}
critics2 = {
    'user1':{
        'item1': 5,
        'item2': 5,
        'item3': 5,
        },
    'user2':{
        'item1': 5,
        'item2': 5,
        'item3': 5,
        }
}
critics3 = {
    'user1':{
        'item1': 1,
        'item2': 3,
        'item3': 5,
        },
    'user2':{
        'item1': 5,
        'item2': 3,
        'item3': 1,
        }
}

print sim_pearson(critics, 'user1', 'user2', )
result: 1.0 (expected)
print sim_pearson(critics2, 'user1', 'user2', )
result: 0 (unexpected)
print sim_pearson(critics3, 'user1', 'user2', )
result: -1 (expected)

Solution

  • There is nothing wrong in your result. You are trying to plot a line through 3 points. In second case you have all three points with the same coordinates, i.e. effectively one point. You can't say do these points correlate or anti-correlate, because you can draw infinite number of lines through one point (den in your code equals to zero).