Search code examples
pythonpandasdataframegroup-bysubclass

Preserving DataFrame subclass type during pandas groupby().aggregate()


I'm subclassing pandas DataFrame in a project of mine. Most pandas operations preserve the subclass type, but df.groupby().agg() does not. Is this a bug? Is there a known workaround?

import pandas as pd

class MySeries(pd.Series):
    pass

class MyDataFrame(pd.DataFrame):
    @property
    def _constructor(self):
        return MyDataFrame
    _constructor_sliced = MySeries

MySeries._constructor_expanddim = MyDataFrame

df = MyDataFrame({"a": reversed(range(10)), "b": list('aaaabbbccc')})

print(type(df.groupby("b").sum()))
# <class '__main__.MyDataFrame'>

print(type(df.groupby("b").agg({"a": "sum"})))
# <class 'pandas.core.frame.DataFrame'>

It looks like there was an issue (described here) that fixed subclassing for df.groupby, but as far as I can tell df.groupby().agg() was missed. I'm using pandas version 2.0.3.


Solution

  • It turns out that groupby().agg() combines Series to build a DataFrame, so the subclassed Series constructor needs to be properly defined. See this documentation.

    The following code runs with no errors:

    import pandas as pd
    
    class MySeries(pd.Series):
        @property
        def _constructor(self):
            return MySeries
    
        @property
        def _constructor_expanddim(self):
            return MyDataFrame
    
    class MyDataFrame(pd.DataFrame):
        @property
        def _constructor(self):
            return MyDataFrame
    
        @property
        def _constructor_sliced(self):
            return MySeries
    
    
    df = MyDataFrame({"a": reversed(range(10)), "b": list('aaaabbbccc')})
    
    assert isinstance(df.groupby("b").agg({"a": "sum"}), MyDataFrame)