Search code examples
pythonpandasmulti-index

Is there a way to apply a function to a MultiIndex dataframe slice with the same outer index without iterating each slice?


Basically, what I'm trying to accomplish is to fill the missing dates (creating new DataFrame rows) with respect to each product, then create a new column based on a cumulative sum of column 'A' (example shown below)

The data is a MultiIndex with (product, date) as indexes.

Basically I would like to apply this answer to a MultiIndex DataFrame using only the rightmost index and calculating a subsequent np.cumsum for each product (and all dates).

                    A
product    date 
0       2017-01-02  1
        2017-01-03  2
        2017-01-04  2
        2017-01-05  1
        2017-01-06  4
        2017-01-07  1
        2017-01-10  7
1       2018-06-29  1
        2018-06-30  4
        2018-07-01  1
        2018-07-02  1
        2018-07-04  2

What I want to accomplish (efficiently) is:

                    A      CumSum
product    date 
0       2017-01-02  1        1
        2017-01-03  2        3
        2017-01-04  2        5
        2017-01-05  1        6
        2017-01-06  4        10
        2017-01-07  1        11
        2017-01-08  0        11
        2017-01-09  0        11
        2017-01-10  7        18
1       2018-06-29  1        1
        2018-06-30  4        5
        2018-07-01  1        6
        2018-07-02  1        7
        2018-07-03  0        7
        2018-07-04  2        9

Solution

  • You have 2 ways:

    One way:
    Using groupby with apply and with resample and cumsum. Finally, pd.concat result with df.A and fillna with 0

    s = (df.reset_index(0).groupby('product').apply(lambda x: x.resample(rule='D')
                                                               .asfreq(0).A.cumsum()))
    pd.concat([df.A, s.rename('cumsum')], axis=1).fillna(0)
    
    Out[337]:
                          A  cumsum
    product date
    0       2017-01-02  1.0       1
            2017-01-03  2.0       3
            2017-01-04  2.0       5
            2017-01-05  1.0       6
            2017-01-06  4.0      10
            2017-01-07  1.0      11
            2017-01-08  0.0      11
            2017-01-09  0.0      11
            2017-01-10  7.0      18
    1       2018-06-29  1.0       1
            2018-06-30  4.0       5
            2018-07-01  1.0       6
            2018-07-02  1.0       7
            2018-07-03  0.0       7
            2018-07-04  2.0       9
    

    Another way:
    you need 2 groupbys. First one for resample, 2nd one for cumsum. Finally, use pd.concat and fillna with 0

    s1 = df.reset_index(0).groupby('product').resample(rule='D').asfreq(0).A
    pd.concat([df.A, s1.groupby(level=0).cumsum().rename('cumsum')], axis=1).fillna(0)
    
    Out[351]:
                          A  cumsum
    product date
    0       2017-01-02  1.0       1
            2017-01-03  2.0       3
            2017-01-04  2.0       5
            2017-01-05  1.0       6
            2017-01-06  4.0      10
            2017-01-07  1.0      11
            2017-01-08  0.0      11
            2017-01-09  0.0      11
            2017-01-10  7.0      18
    1       2018-06-29  1.0       1
            2018-06-30  4.0       5
            2018-07-01  1.0       6
            2018-07-02  1.0       7
            2018-07-03  0.0       7
            2018-07-04  2.0       9