Search code examples
python-3.xpandasgroup-by

Pandas groupby and then add new row for each group


I have a pandas DataFrame that I want to group by the 'id' column then add an extra row to the bottom of each group where the date for that row is one business day forward from the last row of the group.

df = pd.DataFrame(data={'d': [datetime.date(2010,12,30), datetime.date(2010,12,31), datetime.date(2010,12,30),datetime.date(2010,12,31)], 'id': [1,1,2,2], 'val': [10,200, 90,420]})

I have:

        Date      id    val
0    2010-12-30    1    10
1    2010-12-31    1    200
2    2010-12-30    2    90
3    2010-12-31    2    420

I want:

     Date         id    val
0    2010-12-30    1    10
1    2010-12-31    1    200
2    2011-01-01    1    NaN 
3    2010-12-30    2    90
4    2010-12-31    2    420
5    2011-01-01    2    Nan

The solution shown here from zero seems like it should work:

Pandas: add row to each group depending on condition

I have tried adapting it to my situation but just can't make it work:

def add_row(x):
    from pandas.tseries.offsets import BDay

    last_row = x.iloc[-1]
    last_row['Date'] = x.Date + BDay(1)
    
    return x.append(last_row)   

df.groupby('id').apply(add_row)

AttributeError: 'DataFrame' object has no attribute 'Date'

I don't just want to resolve this particular error message I want a solution to the problem.


Solution

  • I would use:

    df['d'] = pd.to_datetime(df['d'])
    
    out = pd.concat([df,
                     (df.loc[df.groupby('id')['d'].idxmax(), ['d', 'id']]
                        .assign(d=lambda x: x['d'].add(pd.DateOffset(days=1)))
                     )
                    ]
                    ).sort_index(kind='stable', ignore_index=True)
    

    NB. for business days replace pd.DateOffset(days=1) by pd.offsets.BusinessDay(1).

    Output:

               d  id    val
    0 2010-12-30   1   10.0
    1 2010-12-31   1  200.0
    2 2011-01-01   1    NaN
    3 2010-12-30   2   90.0
    4 2010-12-31   2  420.0
    5 2011-01-01   2    NaN
    

    Alternative:

    last = df.groupby('id')['d'].idxmax()
    
    out = df.loc[df.index.repeat(df.index.isin(last)+1)]
    
    m = out.index.duplicated()
    
    out.loc[m, 'd'] += pd.DateOffset(days=1)
    out.loc[m, 'val'] = float('nan')