Grouping large sparse pandas dataframe with groupby.sum() is very slow

I have pandas dataframe of size (607875, 12294). The data is sparse and looks like:

     ID BB CC DD ...
0   abc 0  0  1  ...
1   bcd 0  0  0  ...
2   abc 0  0  1  ...
...

I converted it to the sparse form by calling

dataframe = dataframe.to_sparse()

Later, I groupped it by ID and sum the row values by

dataframe = dataframe.groupby("ID").sum()

For smaller dataframes it works perfectly well, but for this size, it worked for an hour and did not finish the work. Is there a way to speed it up or get around it? Is there any other sparse methods I can use because the to_sparse method is deprecated.

The size of output dataframe would be (2000, 12294) and look like (if there is no other 1 in abc column):

     ID BB CC DD ...
0   abc 0  0  2  ...
1   bcd 0  0  0  ...
...

I have 32 GB RAM on my PC, so it should be enough.

Solution

Inspired by https://stackoverflow.com/a/50991732/8035867 here is a solution that relies on Sklearn to do a kind of sparse one-hot encoding of the group labels and then uses Scipy to do a dot product of two sparse row matrices.

Edit: Used One-Hot Encoder instead to cope with the situation where there are only two classes in the group by.

from sklearn.preprocessing import OneHotEncoder

def sparse_groupby_sum(df, groupby):
    ohe = OneHotEncoder(sparse_output=True)
    # Get all other columns we are not grouping by
    other_columns = [col for col in df.columns if col != groupby]
    # Get a 607875 x nDistinctIDs matrix in sparse row format with exactly 
    # 1 nonzero entry per row
    onehot = ohe.fit_transform(df[groupby].values.reshape(-1, 1))
    # Transpose it. then convert from sparse column back to sparse row, as 
    # dot products of two sparse row matrices are faster than sparse col with
    # sparse row
    onehot = onehot.T.tocsr()
    # Dot the transposed matrix with the other columns of the df, converted to sparse row 
    # format, then convert the resulting matrix back into a sparse 
    # dataframe with the same column names
    out = pd.DataFrame.sparse.from_spmatrix(
        onehot.dot(df[other_columns].sparse.to_coo().tocsr()), 
        columns=other_columns)
    # Add in the groupby column to this resulting dataframe with the proper class labels
    out[groupby] = ohe.categories_[0]
    # This final groupby sum simply ensures the result is in the format you would expect 
    # for a regular pandas groupby and sum, but you can just return out if this is going to be 
    # a performance penalty. Note in that case that the groupby column may have changed index
    return out.groupby(groupby).sum()

dataframe = sparse_groupby_sum(dataframe, "ID")

Note that for performance purposes you can inline the definition of the onehot variable to the out = line, I've just separated it out here for didactic purposes.