Search code examples
pythonpandasdataframe

Deprecation Warning with groupby.apply


I have a python script that reads in data from a csv file

The code runs fine, but everytime it runs I get this Deprecation message:

DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.

the warning stems from this piece of code:

fprice = df.groupby(['StartDate', 'Commodity', 'DealType']).apply(lambda group: -(group['MTMValue'].sum() - (group['FixedPriceStrike'] * group['Quantity']).sum()) / group['Quantity'].sum()).reset_index(name='FloatPrice')

to my understanding, I am performing the apply function on my groupings,but then I am disregarding the groupings and not using them anymore to be apart of my dataframe. I am confused about the directions to silence the warning

here is some sample data that this code uses:

TradeID  TradeDate  Commodity  StartDate   ExpiryDate FixedPrice Quantity MTMValue
-------- ---------- ---------  ---------   ---------- ---------- -------- ---------
 aaa   01/01/2024   (com1,com2) 01/01/2024  01/01/2024    10        10      100.00 
 bbb   01/01/2024   (com1,com2) 01/01/2024  01/01/2024    10        10      100.00 
 ccc   01/01/2024   (com1,com2) 01/01/2024  01/01/2024    10        10      100.00  

and here is the expected output from this data:

TradeID  TradeDate  Commodity  StartDate   ExpiryDate FixedPrice Quantity MTMValue  FloatPrice
-------- ---------- ---------  ---------   ---------- ---------- -------- --------- ----------
 aaa   01/01/2024   (com1,com2) 01/01/2024  01/01/2024    10        10      100.00      0
 bbb   01/01/2024   (com1,com2) 01/01/2024  01/01/2024    10        10      100.00      0
 ccc   01/01/2024   (com1,com2) 01/01/2024  01/01/2024    10        10      100.00      0 

Solution

  • About include_groups parameter

    The include_groups parameter of DataFrameGroupBy.apply is new in pandas version 2.2.0. It is basically a transition period (2.2.0 -> 3.0) parameter added to help communicating a changing behavior (with warnings) and to tackle pandas Issue 7155. In most cases you should be able to just set it to False to silent the warning (see below).

    Setup

    Let's say you have a pandas DataFrame df and a dummy function myfunc for apply, and you want to

    • Group by column 'c'
    • Apply myfunc on each group
    >>> df
          a  value     c
    0   foo     10  cat1
    1   bar     20  cat2
    2   baz     30  cat1
    3  quux     40  cat2
    
    
    >>> def myfunc(x):
        print(x, '\n')
        
    

    include_groups = True (Old behavior)

    • This is the default behavior in pandas <2.2.0 (there is no include_groups parameter)
    • pandas 2.2.0 and above (likely until 3.0) will still default to this but issue a DeprecationWarning.
    • The grouping column(s), here 'c' is included in the DataFrameGroupBy
    >>> df.groupby('c').apply(myfunc)
         a  value     c
    0  foo     10  cat1
    2  baz     30  cat1 
    
          a  value     c
    1   bar     20  cat2
    3  quux     40  cat2 
    

    Now as mentioned in Issue 7155, keeping the grouping column c in the dataframe passed to apply is unwanted behavior. Most people will not expect c to be present here. The answer of bue has actually an example how this could lead to bugs; apply on np.mean and expect there be less columns (causes a bug if your grouping column is numerical).

    include_groups = False (new behavior)

    • This will remove the warning in the pandas > 2.2.0 (<3.0)
    • This will be the default in future version of pandas (likely 3.0)
    • This is what you likely would want to have; drop the grouping column 'c':
    >>> df.groupby('c').apply(myfunc, include_groups=False)
         a  value
    0  foo     10
    2  baz     30 
    
          a  value
    1   bar     20
    3  quux     40 
    

    Circumventing need to use include_groups at all

    Option 1: Explicitly giving column names

    You may also skip the need for using the include_groups parameter at all by explicitly giving the list of the columns (as pointed out by the warning itself; "..or explicitly select the grouping columns after groupby to silence this warning..", and Cahit in their answer), like this:

    >>> df.groupby('c')[['a', 'value', 'c']].apply(myfunc)
         a  value     c
    0  foo     10  cat1
    2  baz     30  cat1 
    
          a  value     c
    1   bar     20  cat2
    3  quux     40  cat2 
    
    Empty DataFrame
    Columns: []
    Index: []
    

    Option 2: Setting the index before groupby

    You may also set the groupby column to the index, as pointed out by Stefan in the comments.

    >>> df.set_index('c').groupby(level='c').apply(myfunc)
            a  value
    c               
    cat1  foo     10
    cat1  baz     30 
    
             a  value
    c                
    cat2   bar     20
    cat2  quux     40 
    
    Empty DataFrame
    Columns: []
    Index: []
    
    

    Details just for this use case

    Your grouping columns are

    ['StartDate', 'Commodity', 'DealType']
    

    In the apply function you use the following columns:

    ['MTMValue',  'FixedPriceStrike', 'Quantity']
    

    i.e., you do not need any of the grouping columns in your apply, and therefore you can use include_groups=False which also removes the warning.

    fprice = df.groupby(['StartDate', 'Commodity', 'DealType']).apply(lambda group: -(group['MTMValue'].sum() - (group['FixedPriceStrike'] * group['Quantity']).sum()) / group['Quantity'].sum(), include_groups=False).reset_index(name='FloatPrice')