Search code examples
pandasnumpypython-polars

Convert pandas dataframe to sparse matrix


I have a pandas dataframe e.g.

data = {'col_1': ['a', 'b'], 'col_2': ['b', 'c']}
df = pd.DataFrame(data)

I want to convert this to a sparse representation of the data in numpy e.g.

[[[1, 0], [0, 0]], [[0, 1], [1, 0]], [[0, 0], [0, 1]]]

where the each 2x2 matrix represents the occurences of 'a', 'b' and 'c' in my pandas dataframe.

I can achieve the desired outcome through some messy operations:

boolean_matrix = pd.get_dummies(df, prefix='', prefix_sep='').groupby(level=0, axis=1).sum()

boolean_matrix = boolean_matrix.values.tolist()
boolean_matrix = [[[int(i == j) for j in range(len(boolean_matrix[0]))] for i in row] for row in boolean_matrix]

print(boolean_matrix)

But I can't believe this is the standard way to do what is probably a pretty common operation, are there any inbuild methods (pandas, polars, numpy, tensorflow) that will do this?


Solution

  • Let's use broadcasting and unique:

    out = (df.to_numpy() == np.unique(df)[:,None,None]).astype(int)
    

    Or, for a specific order:

    out = (df.to_numpy() == np.array(['a', 'b', 'c'])[:,None,None]).astype(int)
    

    Output:

    array([[[1, 0],
            [0, 0]],
    
           [[0, 1],
            [1, 0]],
    
           [[0, 0],
            [0, 1]]])