Search code examples
pythonpandasmulti-index

Pandas Reindexing MultiIndex Relative to Arbitrary Level


I'm trying to reindex a dataframe relative to the second level of an index. I have a dataframe where the first level of the index is user id and the second level is date. For example:

pd.DataFrame({
'id': 3*['A'] + 5*['B'] + 4*['C'],
'date': ['01-01-2010', '02-01-2010', '12-01-2010',
         '04-01-2015', '05-01-2015', '03-01-2016', '04-01-2016', '05-01-2016',
         '01-01-2015', '02-01-2015', '03-01-2015', '04-01-2015'],
'value': np.random.randint(10,100, 12)})\
.set_index(['id', 'date'])

I want to reindex the dates to fill in the missing dates, but only for the dates between the max and min dates for each "id" group.

For example user "A" should have continuous monthly data from January to December 2010 and user "B" should have continuous dates between April 2015 through May 2016. For simplicity let's assume I want to fill the NaNs with zeros.

Other questions similar to this assume that I want to use the same date_range for all users, which doesn't work in this use case. Any ideas?


Solution

  • I think you need reset_index + groupby + resample + asfreq + fillna:

    np.random.seed(123)
    df = pd.DataFrame({
    'id': 3*['A'] + 5*['B'] + 4*['C'],
    'date': ['01-01-2010', '02-01-2010', '12-01-2010',
             '04-01-2015', '05-01-2015', '03-01-2016', '04-01-2016', '05-01-2016',
             '01-01-2015', '02-01-2015', '03-01-2015', '04-01-2015'],
    'value': np.random.randint(10,100, 12)})
    
    df['date'] = pd.to_datetime(df['date'])
    df = df.set_index(['id', 'date'])
    print (df)
                   value
    id date             
    A  2010-01-01     76
       2010-02-01     27
       2010-12-01     93
    B  2015-04-01     67
       2015-05-01     96
       2016-03-01     57
       2016-04-01     83
       2016-05-01     42
    C  2015-01-01     56
       2015-02-01     35
       2015-03-01     93
       2015-04-01     88
    

    df1 = df.reset_index(level='id').groupby('id')['value'].resample('D').asfreq().fillna(0)
    print (df1.head(10))
                   value
    id date             
    A  2010-01-01   76.0
       2010-01-02    0.0
       2010-01-03    0.0
       2010-01-04    0.0
       2010-01-05    0.0
       2010-01-06    0.0
       2010-01-07    0.0
       2010-01-08    0.0
       2010-01-09    0.0
       2010-01-10    0.0
    

    But if need process only max and min dates first need select data with agg by idxmax idxmin with loc:

    df = df.reset_index()
    df1 = df.loc[df.groupby('id')['date'].agg(['idxmin', 'idxmax']).stack()]
    print (df1)
       id       date  value
    0   A 2010-01-01     76
    2   A 2010-12-01     93
    3   B 2015-04-01     67
    7   B 2016-05-01     42
    8   C 2015-01-01     56
    11  C 2015-04-01     88
    
    df1 = df1.set_index('date').groupby('id')['value'].resample('MS').asfreq().fillna(0)
    print (df1.head(10))