I have pandas dataframe of size (607875, 12294)
. The data is sparse and looks like:
ID BB CC DD ...
0 abc 0 0 1 ...
1 bcd 0 0 0 ...
2 abc 0 0 1 ...
...
I converted it to the sparse form by calling
dataframe = dataframe.to_sparse()
Later, I groupped it by ID
and sum
the row values by
dataframe = dataframe.groupby("ID").sum()
For smaller dataframes it works perfectly well, but for this size, it worked for an hour and did not finish the work.
Is there a way to speed it up or get around it? Is there any other sparse methods I can use because the to_sparse
method is deprecated.
The size of output dataframe would be (2000, 12294)
and look like (if there is no other 1 in abc
column):
ID BB CC DD ...
0 abc 0 0 2 ...
1 bcd 0 0 0 ...
...
I have 32 GB RAM on my PC, so it should be enough.
Inspired by https://stackoverflow.com/a/50991732/8035867 here is a solution that relies on Sklearn to do a kind of sparse one-hot encoding of the group labels and then uses Scipy to do a dot product of two sparse row matrices.
Edit: Used One-Hot Encoder instead to cope with the situation where there are only two classes in the group by.
from sklearn.preprocessing import OneHotEncoder
def sparse_groupby_sum(df, groupby):
ohe = OneHotEncoder(sparse_output=True)
# Get all other columns we are not grouping by
other_columns = [col for col in df.columns if col != groupby]
# Get a 607875 x nDistinctIDs matrix in sparse row format with exactly
# 1 nonzero entry per row
onehot = ohe.fit_transform(df[groupby].values.reshape(-1, 1))
# Transpose it. then convert from sparse column back to sparse row, as
# dot products of two sparse row matrices are faster than sparse col with
# sparse row
onehot = onehot.T.tocsr()
# Dot the transposed matrix with the other columns of the df, converted to sparse row
# format, then convert the resulting matrix back into a sparse
# dataframe with the same column names
out = pd.DataFrame.sparse.from_spmatrix(
onehot.dot(df[other_columns].sparse.to_coo().tocsr()),
columns=other_columns)
# Add in the groupby column to this resulting dataframe with the proper class labels
out[groupby] = ohe.categories_[0]
# This final groupby sum simply ensures the result is in the format you would expect
# for a regular pandas groupby and sum, but you can just return out if this is going to be
# a performance penalty. Note in that case that the groupby column may have changed index
return out.groupby(groupby).sum()
dataframe = sparse_groupby_sum(dataframe, "ID")
Note that for performance purposes you can inline the definition of the onehot variable to the out =
line, I've just separated it out here for didactic purposes.