python matrix scikit-learn sklearn-pandas matrix-factorization

python dict to matrix using implict feedback approach

I'm seeking to configure the best model for measuring implicit feedback given my nested dictionary of unique users and the number of plays per artist listened to. I've tried a few values but they appear only cosmetic, and not measurable.

Sample data:

    users                                       artist   plays
0   00001411dc427966b17297bf4d69e7e193135d89    korn     12763
1   00001411dc427966b17297bf4d69e7e193135d89    sting    8192
2   00001411dc427966b17297bf4d69e7e193135d89    nirvana  6413

Source code:

user_artist_dict = user_artist_plays.groupby('users').apply(lambda user_artist_plays: dict(zip(user_artist_plays.artist, user_artist_plays.plays))).to_dict()

I want a matrix factoring 0 plays if user did not listen to artist, and if they had it inserts the plays.

My initial intention was to use DictVectorizer with sklearn but it's giving me trouble with artist strings.

Solution

Sparse version. Current version of pandas has limited support of sparse data, you can call pd.get_dummies(sparse=True) and it will return some SparseDataFrame but as result of most of manipulation it will be transformed back to DataFrame. Sparse versions of unstack and pivot_table are still in roadmap. So we should use other libraries to solve it, for example sklearn has LabelBinarizer with the same functionality as pd.get_dummies but it can return scipy.sparse matrix. Then with help of small algebra tricks we can achieve the goal.

Test data sample will help check correctness of calculations:

df = pd.DataFrame([
             ['a1', 11, 'u1'], ['a3', 23, 'u2'], ['a2', 22, 'u2'],
             ['a3', 33, 'u3'], ['a1', 31, 'u3'], ['a2', 32, 'u3'],
             ['a5', 45, 'u4'], ['a4', 44, 'u4'], ['a3', 43, 'u4'],
             ['a2', 42, 'u4']], columns =['artist', 'plays', 'users'])
print(df.pivot_table(values='plays',index='users',
                     columns='artist',aggfunc='sum',fill_value=0))

output

artist  a1  a2  a3  a4  a5
users                     
u1      11   0   0   0   0
u2       0  22  23   0   0
u3      31  32  33   0   0
u4       0  42  43  44  45

The same result but sparse

from sklearn.preprocessing import LabelBinarizer
from scipy.sparse import diags
lb_artist = LabelBinarizer(sparse_output=True)
lb_user = LabelBinarizer(sparse_output=True)
X = lb_user.fit_transform(df.users).T*(diags(df.plays)*lb_artist.fit_transform(df.artist))
print(type(X),'shape:',X.shape,'values:',X.nnz,'data:')
print(X)

Output. For test sample it should be (user,artist) 10*(user+1)+artist+1

<class 'scipy.sparse.csc.csc_matrix'> shape: (4, 5) values: 10 data:
  (2, 0)    31.0
  (0, 0)    11.0
  (3, 1)    42.0
  (2, 1)    32.0
  (1, 1)    22.0
  (3, 2)    43.0
  (2, 2)    33.0
  (1, 2)    23.0
  (3, 3)    44.0
  (3, 4)    45.0

lb_user and lb_artist store string values of keys it can be used to recover original frame.

print(
    pd.DataFrame(X.todense(),index=lb_user.classes_,columns=lb_artist.classes_)
    )

Output:

         a1    a2    a3    a4    a5
    u1  11.0   0.0   0.0   0.0   0.0
    u2   0.0  22.0  23.0   0.0   0.0
    u3  31.0  32.0  33.0   0.0   0.0
    u4   0.0  42.0  43.0  44.0  45.0