Search code examples
pythonpandasgroup-bysparse-dataframe

Grouping large sparse pandas dataframe with groupby.sum() is very slow


I have pandas dataframe of size (607875, 12294). The data is sparse and looks like:

     ID BB CC DD ...
0   abc 0  0  1  ...
1   bcd 0  0  0  ...
2   abc 0  0  1  ...
...

I converted it to the sparse form by calling

dataframe = dataframe.to_sparse()

Later, I groupped it by ID and sum the row values by

dataframe = dataframe.groupby("ID").sum()

For smaller dataframes it works perfectly well, but for this size, it worked for an hour and did not finish the work. Is there a way to speed it up or get around it? Is there any other sparse methods I can use because the to_sparse method is deprecated.

The size of output dataframe would be (2000, 12294) and look like (if there is no other 1 in abc column):

     ID BB CC DD ...
0   abc 0  0  2  ...
1   bcd 0  0  0  ...
...

I have 32 GB RAM on my PC, so it should be enough.


Solution

  • Inspired by https://stackoverflow.com/a/50991732/8035867 here is a solution that relies on Sklearn to do a kind of sparse one-hot encoding of the group labels and then uses Scipy to do a dot product of two sparse row matrices.

    Edit: Used One-Hot Encoder instead to cope with the situation where there are only two classes in the group by.

    from sklearn.preprocessing import OneHotEncoder
    
    def sparse_groupby_sum(df, groupby):
        ohe = OneHotEncoder(sparse_output=True)
        # Get all other columns we are not grouping by
        other_columns = [col for col in df.columns if col != groupby]
        # Get a 607875 x nDistinctIDs matrix in sparse row format with exactly 
        # 1 nonzero entry per row
        onehot = ohe.fit_transform(df[groupby].values.reshape(-1, 1))
        # Transpose it. then convert from sparse column back to sparse row, as 
        # dot products of two sparse row matrices are faster than sparse col with
        # sparse row
        onehot = onehot.T.tocsr()
        # Dot the transposed matrix with the other columns of the df, converted to sparse row 
        # format, then convert the resulting matrix back into a sparse 
        # dataframe with the same column names
        out = pd.DataFrame.sparse.from_spmatrix(
            onehot.dot(df[other_columns].sparse.to_coo().tocsr()), 
            columns=other_columns)
        # Add in the groupby column to this resulting dataframe with the proper class labels
        out[groupby] = ohe.categories_[0]
        # This final groupby sum simply ensures the result is in the format you would expect 
        # for a regular pandas groupby and sum, but you can just return out if this is going to be 
        # a performance penalty. Note in that case that the groupby column may have changed index
        return out.groupby(groupby).sum()
    
    dataframe = sparse_groupby_sum(dataframe, "ID")
    

    Note that for performance purposes you can inline the definition of the onehot variable to the out = line, I've just separated it out here for didactic purposes.