Search code examples
pythonpandasmatrixscikit-learnsparse-matrix

Pandas: Dataframe to Matrix


I'm quite new at this topics. I'm currently developing a latent factor matrix factorization that will be training data for a Neural Network.

I have a csv table like this:

user_id song_id playcount
frank   SOBYHAJ12A6701BF1D  23
john    SODACBL12A8C13C273  1
john    SODXRTY12AB0180F3B  3
mary    SOFRQTD12A81C233C0  1

You could think this table as a description of a matrix. I want to build a matrix as:

rows=song_id, columns=user_id, value=playcount

I've loaded the data into a pandas dataframe:

triplets_training_set = pd.read_csv(filepath)

Now I want to build a sparse matrix with that data.

Another question:

Do I need to vectorize the values? i.e. translate 'b80344d063b5ccb3212f76538f3d9e43d87dca9e' to a integer user_id? (same with song_id)

I've read questions like this but I don't know how to approach the last question


The only solution I came up with, was to first make 2 dicts like:

{ frank: 1, john: 2, mary:3, ..}
{ SOBYHAJ12A6701BF1D:1 , SODACBL12A8C13C273:2. ..}

and then iterate over the dataframe triplets_training_set row by row, constructing the matrix. But this is a naive solution. There must be a better one.

Thanks in advance!


Solution

  • Is this what you want ?

    df.pivot(*df.columns)
    Out[648]: 
    song_id  SOBYHAJ12A6701BF1D  SODACBL12A8C13C273  SODXRTY12AB0180F3B  \
    user_id                                                               
    frank                  23.0                 NaN                 NaN   
    john                    NaN                 1.0                 3.0   
    mary                    NaN                 NaN                 NaN   
    song_id  SOFRQTD12A81C233C0  
    user_id                      
    frank                   NaN  
    john                    NaN  
    mary                    1.0