Search code examples
pandaspandas-groupby

Pandas v 0.25 groupby with many columns gives memory error


After updating to pandas v0.25.2 a script doing a groupby over many columns on a large dataframe no longer works. I get a memory error

MemoryError: Unable to allocate array with shape (some huge number...,) and data type int64

Doing a bit of research I find issue (#14942) reported on Git for an earlier version

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'cat': np.random.randint(0, 255, size=3000000),
    'int_id': np.random.randint(0, 255, size=3000000),
    'other_id': np.random.randint(0, 10000, size=3000000),
    'foo': 0
}) 
df['cat'] = df.cat.astype(str).astype('category')

# killed after 6 minutes of 100% cpu and  90G maximum main memory usage
grouped = df.groupby(['cat', 'int_id', 'other_id']).count()

Running this code (on version 0.25.2) also gives a memory error. Am I doing something wrong (is the syntax in pandas v0.25 changed?), or has this issue, which is marked as resolved, returned?


Solution

  • Use observed=True to fix it and prevent the groupby to expand all possible combination of factor variables:

    df.groupby(index, observed=True)
    

    There is a related GitHub Issue: PERF: groupby with many empty groups memory blowup. The issue has been closed and the default will be changed from observed=False to observed=True in a future version of pandas.