groupby on sparse matrix in pandas: filling them first

I have a pandas DataFrame df with shape (1000000,3) as follows:

id      cat       team
1       'cat1'    A
1       'cat2'    A
2       'cat3'    B
3       'cat1'    A
4       'cat3'    B
4       'cat1'    B

Then I dummify with respect to the cat column in order to get ready for a machine learning classification.

df2 = pandas.get_dummies(df,columns=['cat'], sparse=True)

But when I try to do:

df2.groupby(['id','team']).sum()

It get stuck and the computing never ends. So instead of grouping by right away, I try:

df2 = df2.fillna(0)

But it does not work and the DataFrame is still full of NaN values. Why does the fillna() function does not fill my DataFrame as it should? In other words, how can a pandas sparse matrix I got from get_dummies be filled with 0 instead of NaN?

I also tried:

df2 = pandas.get_dummies(df,columns=['cat'], sparse=True).to_sparse(fill_value=0)

This time, df2 is well filled with 0, but when I try:

print df2.groupby(['id','sexe']).sum()

I get:

C:\Anaconda\lib\site-packages\pandas\core\groupby.pyc in loop(labels, shape)
   3545         for i in range(1, nlev):
   3546             stride //= shape[i]
-> 3547             out += labels[i] * stride
   3548 
   3549         if xnull: # exclude nulls
ValueError: operands could not be broadcast together with shapes (1205800,) (306994,) (1205800,)

My solution was to do:

df2 = pandas.DataFrame(np.nan_to_num(df2.as_matrix()))
df2.groupby(['id','sexe']).sum()

And it works, but it takes a lot of memory. Can someone help me to find a better solution or at least understand why I can't fill sparse matrix with zeros easily? And why it is impossible to use groupby() then sum() on a sparse matrix?

Solution

I think your problem is due to mixing of dtypes. But you could get around it like this. First, provide only the relevant column to get_dummies() rather than the whole dataframe:

df2 = pd.get_dummies(df['cat']).to_sparse(0)

After that, you can add other variables back but everything needs to be numeric. A pandas sparse dataframe is just a wrapper on a sparse (and homogenous dtype) numpy array.

df2['id'] = df['id']

   'cat1'  'cat2'  'cat3'  id
0       1       0       0   1
1       0       1       0   1
2       0       0       1   2
3       1       0       0   3
4       0       0       1   4
5       1       0       0   4

For non-numeric types, you could do the following:

df2['team'] = df['team'].astype('category').cat.codes

This groupby seems to work OK:

df2.groupby('id').sum()

    'cat1'  'cat2'  'cat3'
id                        
1        1       1       0
2        0       0       1
3        1       0       0
4        1       0       1

An additional but possibly important point for memory management is that you can often save substantial memory with categoricals rather than string objects (perhaps you are already doing this though):

df['cat2'] = df['cat'].astype('category')

df[['cat','cat2']].memory_usage()

cat     48
cat2    30

Not much savings here for the small example dataframe but could be a substantial difference in your actual dataframe.