Search code examples
pythonpivot-tablesparse-matrix

How to make a sparse matrix in python from a data frame having column names as string


I need to convert a data frame to sparse matrix. The data frame looks similar to this: (The actual data is way too big (Approx 500 000 rows and 1000 columns)).

Dataframe

I need to convert it into a matrix such that the rows of the matrix are 'id' and columns are 'names' and should show only the finite values. No nans should be shown (to reduce memory usage). And when I tried using pd.pivot_table, it was taking a long time to make the matrix for my big data.

In R, there is a method called 'dMcast' for this purpose. I explored but could not find the alternate of this in python. I'm new to python.


Solution

  • First i will convert the categorical names column to indices. Maybe pandas has this functionality already?

    names = list('PQRSPSS')
    name_ids_map = {n:i for i, n in enumerate(set(names))}
    name_ids = [name_ids_map[n] for n in names]
    

    Then I would use scipy.sparse.coo and then maybe convert that to another sparse format.

    ids = [1, 1, 1, 1, 2, 2, 3]
    rating = [2, 4, 1, 4, 2, 2, 1]
    sp = scipy.sparse.coo_matrix((rating, (ids, name_ids))
    print(sp)
    sp.tocsc()
    

    I am not aware of a sparse matrix library that can index a dimension with categorical data like 'R', 'S" etc