I'm quite new at this topics. I'm currently developing a latent factor matrix factorization that will be training data for a Neural Network.
I have a csv table like this:
user_id song_id playcount
frank SOBYHAJ12A6701BF1D 23
john SODACBL12A8C13C273 1
john SODXRTY12AB0180F3B 3
mary SOFRQTD12A81C233C0 1
You could think this table as a description of a matrix. I want to build a matrix as:
rows=song_id, columns=user_id, value=playcount
I've loaded the data into a pandas dataframe:
triplets_training_set = pd.read_csv(filepath)
Now I want to build a sparse matrix with that data.
Another question:
Do I need to vectorize the values? i.e. translate 'b80344d063b5ccb3212f76538f3d9e43d87dca9e' to a integer user_id? (same with song_id)
I've read questions like this but I don't know how to approach the last question
The only solution I came up with, was to first make 2 dicts like:
{ frank: 1, john: 2, mary:3, ..}
{ SOBYHAJ12A6701BF1D:1 , SODACBL12A8C13C273:2. ..}
and then iterate over the dataframe triplets_training_set row by row, constructing the matrix. But this is a naive solution. There must be a better one.
Thanks in advance!
Is this what you want ?
df.pivot(*df.columns)
Out[648]:
song_id SOBYHAJ12A6701BF1D SODACBL12A8C13C273 SODXRTY12AB0180F3B \
user_id
frank 23.0 NaN NaN
john NaN 1.0 3.0
mary NaN NaN NaN
song_id SOFRQTD12A81C233C0
user_id
frank NaN
john NaN
mary 1.0