I'm seeking to configure the best model for measuring implicit feedback given my nested dictionary of unique users and the number of plays per artist listened to. I've tried a few values but they appear only cosmetic, and not measurable.
Sample data:
users artist plays
0 00001411dc427966b17297bf4d69e7e193135d89 korn 12763
1 00001411dc427966b17297bf4d69e7e193135d89 sting 8192
2 00001411dc427966b17297bf4d69e7e193135d89 nirvana 6413
Source code:
user_artist_dict = user_artist_plays.groupby('users').apply(lambda user_artist_plays: dict(zip(user_artist_plays.artist, user_artist_plays.plays))).to_dict()
I want a matrix factoring 0
plays if user did not listen to artist, and if they had it inserts the plays
.
My initial intention was to use DictVectorizer with sklearn but it's giving me trouble with artist
strings.
Sparse version.
Current version of pandas
has limited support of sparse data, you can call pd.get_dummies(sparse=True)
and it will return some SparseDataFrame
but as result of most of manipulation it will be transformed back to DataFrame
.
Sparse versions of unstack
and pivot_table
are still in roadmap.
So we should use other libraries to solve it, for example sklearn has LabelBinarizer with the same functionality as pd.get_dummies
but it can return scipy.sparse
matrix. Then with help of small algebra tricks we can achieve the goal.
Test data sample will help check correctness of calculations:
df = pd.DataFrame([
['a1', 11, 'u1'], ['a3', 23, 'u2'], ['a2', 22, 'u2'],
['a3', 33, 'u3'], ['a1', 31, 'u3'], ['a2', 32, 'u3'],
['a5', 45, 'u4'], ['a4', 44, 'u4'], ['a3', 43, 'u4'],
['a2', 42, 'u4']], columns =['artist', 'plays', 'users'])
print(df.pivot_table(values='plays',index='users',
columns='artist',aggfunc='sum',fill_value=0))
output
artist a1 a2 a3 a4 a5
users
u1 11 0 0 0 0
u2 0 22 23 0 0
u3 31 32 33 0 0
u4 0 42 43 44 45
The same result but sparse
from sklearn.preprocessing import LabelBinarizer
from scipy.sparse import diags
lb_artist = LabelBinarizer(sparse_output=True)
lb_user = LabelBinarizer(sparse_output=True)
X = lb_user.fit_transform(df.users).T*(diags(df.plays)*lb_artist.fit_transform(df.artist))
print(type(X),'shape:',X.shape,'values:',X.nnz,'data:')
print(X)
Output. For test sample it should be (user,artist) 10*(user+1)+artist+1
<class 'scipy.sparse.csc.csc_matrix'> shape: (4, 5) values: 10 data:
(2, 0) 31.0
(0, 0) 11.0
(3, 1) 42.0
(2, 1) 32.0
(1, 1) 22.0
(3, 2) 43.0
(2, 2) 33.0
(1, 2) 23.0
(3, 3) 44.0
(3, 4) 45.0
lb_user
and lb_artist
store string values of keys it can be used to recover original frame.
print(
pd.DataFrame(X.todense(),index=lb_user.classes_,columns=lb_artist.classes_)
)
Output:
a1 a2 a3 a4 a5
u1 11.0 0.0 0.0 0.0 0.0
u2 0.0 22.0 23.0 0.0 0.0
u3 31.0 32.0 33.0 0.0 0.0
u4 0.0 42.0 43.0 44.0 45.0