Search code examples
pythondataframematrixindexingsparse-matrix

Transforming dataframe to sparse matrix and reset index


I have a data set with the rating of user ID to all product ID. There are only 5000 products and 10,000 users but the ID is in different number. I would like to transform my dataframe to a coo_sparse_matrix(data, (row,col), shape) but with row and col as the real number of products and users, not the ID. Is there any way to do that? Below is the illustration:

Data frame:

User ID Product ID Rating
1 14 0.1
1 15 0.2
2 14 0.3
2 16 0.3
5 19 0.4

and expected to have a matrix (in sparse coo form)

ProductID 14 15 16 19
UserID
1 0.1 0.2 0 0
2 0.3 0 0.3 0
5 0 0 0 0.4

because normally the sparse_coo would give a very large matrix with index (1,2,...,19) for product ID and (1,2,3,4,5) for user ID.

This is for my thesis.


Solution

  • Hi hope this helps and good luck with your thesis:

    import pandas as pd
    from scipy.sparse import coo_matrix
    
    dataframe=pd.DataFrame(data={'User ID':[1,1,2,2,5], 'Product ID':[14,15,14,16,19], 'Rating':[0.1,0.2,0.3,0.3,0.4]})
    
    row=dataframe['User ID']
    col=dataframe['Product ID']
    data=dataframe['Rating']
    
    coo=coo_matrix((data, (row, col))).toarray()
    new_dataframe=pd.DataFrame(coo)
    
    #Drop non existing Product IDs --optional delet if not intended
    new_dataframe=new_dataframe.loc[:, (new_dataframe != new_dataframe.iloc[0]).any()] 
    
    #Drop non existing User IDs --optional delet if not intended
    new_dataframe=new_dataframe.loc[(new_dataframe!=0).any(axis=1)]
    
    print(new_dataframe)
    

    Output:

        14   15   16   19
    1  0.1  0.2  0.0  0.0
    2  0.3  0.0  0.3  0.0
    5  0.0  0.0  0.0  0.4