Search code examples
pythonpandasdataframetime-seriesdatetimeindex

Pandas: combine resampling and group by. How do I average values within same groups before summing them up in a time bin?


I have tabular data (described by col1, col2,...) associated to a Timestamp and a custom made Group. Example:

           Group  Col1  Col2  Col3
Time
2020-05-18     A    10    20    30
2020-05-18     B    10    20    30
2020-05-18     C    10    20    30
2020-05-19     A    10    20    30
2020-05-19     B    10    20    30
2020-05-19     C    10    20    30
2020-05-20     A    10    20    30
2020-05-20     B    10    20    30
2020-05-20     C    10    20    30
2020-05-21     A    10    20    30
2020-05-21     B    10    20    30
2020-05-21     C    10    20    30

This dataframe can be generated by:

dataframe = pd.DataFrame({"Time": ["2020-05-18", "2020-05-18", "2020-05-18", "2020-05-19", "2020-05-19", "2020-05-19", "2020-05-20", "2020-05-20", "2020-05-20", "2020-05-21", "2020-05-21", "2020-05-21"], "Group": ["A","B","C","A","B","C","A","B","C","A","B","C"], "Col1": 12*[10], "Col2": 12*[20], "Col3": 12*[30]})
dataframe["Time"] = pd.to_datetime(dataframe["Time"])
dataframe = dataframe.set_index("Time")

I would like to create time bins (like with resample function), average values WITHIN the same group, and THEN sum them up within the same time bin. But if I write dataframe.resample("2D").sum(), values will be direclty summed up over all the groups without first being averaged in the same group. Like this:

            Col1  Col2  Col3
Time
2020-05-18    60   120   180
2020-05-20    60   120   180

But what I need is to average values belonging to the same group BEFORE summing them up. In this case the desired output would be:

            Col1  Col2  Col3
Time
2020-05-18    30   60   90
2020-05-20    30   60   90

I tried to apply groupby after resampling (e.g. dataframe.resample("2D").grouby("Group")), but it raises a TypeError ("'TimeGrouper' object is not callable")

What could be a way to solve this problem? Thanks in advance.


EDIT:

Output of dataframe.groupby(["Time","Group"]).mean() :

                  Col1  Col2  Col3
Time       Group
2020-05-18 A        10    20    30
           B        10    20    30
           C        10    20    30
2020-05-19 A        10    20    30
           B        10    20    30
           C        10    20    30
2020-05-20 A        10    20    30
           B        10    20    30
           C        10    20    30
2020-05-21 A        10    20    30
           B        10    20    30
           C        10    20    30

Solution

  • Use Grouper with aggregate mean:

    print (dataframe.groupby(['Group',pd.Grouper(freq='2D', key='Time')]).mean())
                      Col1  Col2  Col3
    Group Time                        
    A     2020-05-18    10    20    30
          2020-05-20    10    20    30
    B     2020-05-18    10    20    30
          2020-05-20    10    20    30
    C     2020-05-18    10    20    30
          2020-05-20    10    20    30
    

    And then sum by second level Time:

    df = (
        dataframe
        .groupby(['Group', pd.Grouper(freq='2D', key='Time')])
        .mean()
        .sum(level=1)
    )
    
    print (df)
                Col1  Col2  Col3
    Time                        
    2020-05-18    30    60    90
    2020-05-20    30    60    90