Search code examples
pandasgroup-byaggregate-functions

Filling missing dates in pandas groupby with aggregate functions


I have a dataframe where each id may have gaps. I'd like to fill in missing dates (with NaN values) within each group before using aggregate functions as they don't seem to go by date, but I'm getting the error below.

import pandas as pd

df = pd.DataFrame({'date': ['2014-08-01',
                            '2014-08-02', 
                            '2014-08-03',
                            '2014-08-04',
                            '2014-08-10',
                            '2014-08-15',
                            '2014-08-12',
                            '2014-08-15',
                            '2014-08-20',
                            '2014-08-21',
                            '2014-08-22',                            
                            '2014-08-30',                            
                           ], 
                   'value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], 

                   'id': ['a', 'a', 'a', 'a', 'a', 'a',  'b', 'b', 'b', 'b', 'b', 'b']})

df['date'] = pd.to_datetime(df['date'])

df.sort_values('date', ascending=True).groupby('id').set_index('date').resample('1D').rolling(2).mean()

AttributeError: 'DataFrameGroupBy' object has no attribute 'set_index'


Solution

  • IIUC use Series.asfreq for add consecutive datetimes:

    df['date'] = pd.to_datetime(df['date'])
    
    df1 = (df.sort_values('date', ascending=True)
            .set_index('date')
            .groupby('id')
            .apply(lambda x: x.asfreq('d').rolling(2).mean())
    

    print (df1)
    
                   value
    id date             
    a  2014-08-01    NaN
       2014-08-02    1.5
       2014-08-03    2.5
       2014-08-04    3.5
       2014-08-05    NaN
       2014-08-06    NaN
       2014-08-07    NaN
       2014-08-08    NaN
       2014-08-09    NaN
       2014-08-10    NaN
       2014-08-11    NaN
       2014-08-12    NaN
       2014-08-13    NaN
       2014-08-14    NaN
       2014-08-15    NaN
    b  2014-08-12    NaN
       2014-08-13    NaN
       2014-08-14    NaN
       2014-08-15    NaN
       2014-08-16    NaN
       2014-08-17    NaN
       2014-08-18    NaN
       2014-08-19    NaN
       2014-08-20    NaN
       2014-08-21    9.5
       2014-08-22   10.5
       2014-08-23    NaN
       2014-08-24    NaN
       2014-08-25    NaN
       2014-08-26    NaN
       2014-08-27    NaN
       2014-08-28    NaN
       2014-08-29    NaN
       2014-08-30    NaN
    

    df['date'] = pd.to_datetime(df['date'])
    
    df2 = (df.sort_values('date', ascending=True)
            .set_index('date')
            .groupby('id')['value']
            .apply(lambda x: x.asfreq('d').rolling(2, min_periods=1).mean()))
    

    print (df2)
    id  date      
    a   2014-08-01     1.0
        2014-08-02     1.5
        2014-08-03     2.5
        2014-08-04     3.5
        2014-08-05     4.0
        2014-08-06     NaN
        2014-08-07     NaN
        2014-08-08     NaN
        2014-08-09     NaN
        2014-08-10     5.0
        2014-08-11     5.0
        2014-08-12     NaN
        2014-08-13     NaN
        2014-08-14     NaN
        2014-08-15     6.0
    b   2014-08-12     7.0
        2014-08-13     7.0
        2014-08-14     NaN
        2014-08-15     8.0
        2014-08-16     8.0
        2014-08-17     NaN
        2014-08-18     NaN
        2014-08-19     NaN
        2014-08-20     9.0
        2014-08-21     9.5
        2014-08-22    10.5
        2014-08-23    11.0
        2014-08-24     NaN
        2014-08-25     NaN
        2014-08-26     NaN
        2014-08-27     NaN
        2014-08-28     NaN
        2014-08-29     NaN
        2014-08-30    12.0
    Name: value, dtype: float64
    

    If need forward filling missing values:

    df['date'] = pd.to_datetime(df['date'])
    
    s = (df.sort_values('date', ascending=True)
            .set_index('date')
            .groupby('id')['value']
            .apply(lambda x: x.asfreq('d', method='ffill')))
    

    print (s)
    id  date      
    a   2014-08-01     1
        2014-08-02     2
        2014-08-03     3
        2014-08-04     4
        2014-08-05     4
        2014-08-06     4
        2014-08-07     4
        2014-08-08     4
        2014-08-09     4
        2014-08-10     5
        2014-08-11     5
        2014-08-12     5
        2014-08-13     5
        2014-08-14     5
        2014-08-15     6
    b   2014-08-12     7
        2014-08-13     7
        2014-08-14     7
        2014-08-15     8
        2014-08-16     8
        2014-08-17     8
        2014-08-18     8
        2014-08-19     8
        2014-08-20     9
        2014-08-21    10
        2014-08-22    11
        2014-08-23    11
        2014-08-24    11
        2014-08-25    11
        2014-08-26    11
        2014-08-27    11
        2014-08-28    11
        2014-08-29    11
        2014-08-30    12
    Name: value, dtype: int64