I'm reading Programming Collective Intelligence and writing some of the code in a more pythonic way than it's written in the book, just for the sake of learning.
The first chapter is about recommendation systems. Based on the next dictionary, some similarity measures are proposed.
critics={'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane':
3.5,
'Just My Luck': 3.0, 'Superman Returns': 3.5, 'You, Me and Dupree': 2.5,
'The Night Listener': 3.0},
'Gene Seymour': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5,
'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0,
'You, Me and Dupree': 3.5},
'Michael Phillips': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0,
'Superman Returns': 3.5, 'The Night Listener': 4.0},
'Claudia Puig': {'Snakes on a Plane': 3.5, 'Just My Luck': 3.0,
'The Night Listener': 4.5, 'Superman Returns': 4.0,
'You, Me and Dupree': 2.5},
'Mick LaSalle': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
'Just My Luck': 2.0, 'Superman Returns': 3.0, 'The Night Listener': 3.0,
'You, Me and Dupree': 2.0},
'Jack Matthews': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
'The Night Listener': 3.0, 'Superman Returns': 5.0, 'You, Me and Dupree': 3.5},
'Toby': {'Snakes on a Plane':4.5,'You, Me and Dupree':1.0,'Superman Returns':4.0}}
Given that unique_pairs is a list of tuples containing the different possible pairs of people,
unique_pairs = list(itertools.combinations(people, 2))
unique_pairs
[('Michael Phillips', 'Mick LaSalle'),
('Michael Phillips', 'Lisa Rose'),
('Michael Phillips', 'Toby'),
('Michael Phillips', 'Jack Matthews'),
('Michael Phillips', 'Gene Seymour'),
('Michael Phillips', 'Claudia Puig'),
('Mick LaSalle', 'Lisa Rose'),
('Mick LaSalle', 'Toby'),
('Mick LaSalle', 'Jack Matthews'),
('Mick LaSalle', 'Gene Seymour'),
('Mick LaSalle', 'Claudia Puig'),
('Lisa Rose', 'Toby'),
('Lisa Rose', 'Jack Matthews'),
('Lisa Rose', 'Gene Seymour'),
('Lisa Rose', 'Claudia Puig'),
('Toby', 'Jack Matthews'),
('Toby', 'Gene Seymour'),
('Toby', 'Claudia Puig'),
('Jack Matthews', 'Gene Seymour'),
('Jack Matthews', 'Claudia Puig'),
('Gene Seymour', 'Claudia Puig')]
I tried to improve the Pearson Correlation similarity function suggested in the book by adding a p-value to the result of the function, only outputted if the parameter p_value of the function is true. The function is defined this way:
def sim_pearson(prefs, p1, p2, p_value=False):
"""Returns the pearson correlation coefficient and the p-value (optional)
of the ratings of the movies that both p1 and p2 have rated"""
# Creates a list with the movies that both p1 and p2 have rated
movies = [movie for movie in prefs[p1] if movie in prefs[p2]]
# List of the scores that both p1 and p2 have given to the movies in common
scores_p1 = [prefs[p1][movie] for movie in movies]
scores_p2 = [prefs[p2][movie] for movie in movies]
corr, p_value = scipy.stats.pearsonr(scores_p1, scores_p2)
if p_value:
return (corr, p_value)
else:
return corr
My problem is that the function doesn't work as expected, as it doens't returns the tuple of (correlation coefficient, p-value) all the times when p-value is True, and it produces the same results when p_value is True as when it is false. Why is this happening and how could I fix it?
Here is a list containing the result of applying the function to each of the possible pairs of people, to see what I said. The result is the same with p_value=True as with p_value=False, I'll just paste the former case.
pearson_results = [(pair[0][:5],
pair[1][:5],
sim_pearson(critics, pair[0], pair[1], p_value=True))
for pair in unique_pairs]
pearson_results
[('Micha', 'Mick ', (-0.2581988897471611, 0.74180111025283857)),
('Micha', 'Lisa ', (0.40451991747794525, 0.59548008252205464)),
('Micha', 'Toby', -1.0),
('Micha', 'Jack ', (0.13483997249264842, 0.8651600275073511)),
('Micha', 'Gene ', (0.20459830184114206, 0.79540169815885797)),
('Micha', 'Claud', 1.0),
('Mick ', 'Lisa ', (0.59408852578600457, 0.21370636293028805)),
('Mick ', 'Toby', (0.92447345164190498, 0.24901011701138964)),
('Mick ', 'Jack ', (0.21128856368212914, 0.73299431171284912)),
('Mick ', 'Gene ', (0.41176470588235292, 0.41726032973743138)),
('Mick ', 'Claud', (0.56694670951384085, 0.3189317919127756)),
('Lisa ', 'Toby', (0.99124070716193036, 0.084323216321943714)),
('Lisa ', 'Jack ', (0.74701788083399601, 0.14681146067336839)),
('Lisa ', 'Gene ', (0.39605901719066977, 0.43697492654267506)),
('Lisa ', 'Claud', (0.56694670951384085, 0.3189317919127756)),
('Toby', 'Jack ', (0.66284898035987017, 0.53869426797895403)),
('Toby', 'Gene ', (0.38124642583151169, 0.75098988298861025)),
('Toby', 'Claud', (0.89340514744156441, 0.29661883133160016)),
('Jack ', 'Gene ', (0.96379568187563314, 0.0082243534847899202)),
('Jack ', 'Claud', (0.028571428571428571, 0.9714285714285712)),
('Gene ', 'Claud', (0.31497039417435602, 0.60570041941160946))]
Change the bottom part of your function to:
corr, p_value2 = scipy.stats.pearsonr(scores_p1, scores_p2)
if p_value:
return (corr, p_value2)
else:
return corr