Search code examples
pythonpandassequential

Is there a way to get multiple averages after groupby function in pandas?


I have the following pandas timeseries dataframe:

Index   Time    Centre position X   Centre position Y   Datafile    Group   Zone    Timeframe   dV
8789    1257.318    180.0   201.0   CHR1    CHR Zone A  Before stimulation  15.625000
8790    1257.462    181.0   195.0   CHR1    CHR Zone A  Before stimulation  42.241406
8791    1257.590    184.0   188.0   CHR1    CHR Zone A  Before stimulation  59.498227
8792    1257.718    187.0   184.0   CHR1    CHR Zone B  Before stimulation  39.062500
8793    1257.862    190.0   176.0   CHR1    CHR Zone B  Before stimulation  59.333359
8794    1257.927    190.0   173.0   CHR1    CHR Zone A  Before stimulation  46.153846
8795    1258.054    192.0   171.0   CHR1    CHR Zone A  Before stimulation  22.271080
8796    1258.198    192.0   172.0   CHR1    CHR Zone C  After stimulation   6.944444
8797    1258.326    192.0   171.0   CHR1    CHR Zone C  After stimulation   7.812500
8798    1258.454    191.0   169.0   CHR1    CHR Zone A  After stimulation   17.469281
8799    1258.598    191.0   168.0   CHR1    CHR Zone A  After stimulation   6.944444
8800    1258.726    192.0   165.0   CHR1    CHR Zone A  After stimulation   24.705294

I would like to extract the average speed (dV) grouped by Timeframe and Zone, however, since the data is sequential, I would like to get multiple averages for a given Timeframe and Zone. I couldn't figure out an elegant way to achieve this since groupby averages all the values and outputs a single value.

Expected output is: Expected output

Thank you very much in advance!


Solution

  • The first thing you need do is to create a reference column. A very naive way is like

    df.loc[:,'Zone_shift']=df.loc[:,'Zone'].shift(1)
    df.loc[:,'Timeframe_shift']=df.loc[:,'Timeframe'].shift(1)
    df.loc[:,'Groupby'] = df.apply(lambda x: 0 if x['Zone']==x['Zone_shift'] and x['Timeframe']==x['Timeframe_shift'] else 1, axis=1)
    df.loc[:,'Groupby'] = df.loc[:,'Groupby'].cumsum()
    

    After adding the reference data, the dataframe is look like

        Zone    Timeframe               dV  Zone_shift  Timeframe_shift Groupby
    0   ZoneA   Beforestimulation   15.625    nan        nan            1
    1   ZoneA   Beforestimulation   42.241  ZoneA   Beforestimulation   1
    2   ZoneA   Beforestimulation   59.498  ZoneA   Beforestimulation   1
    3   ZoneB   Beforestimulation   39.062  ZoneA   Beforestimulation   2
    4   ZoneB   Beforestimulation   59.333  ZoneB   Beforestimulation   2
    5   ZoneA   Beforestimulation   46.153  ZoneB   Beforestimulation   3
    6   ZoneA   Beforestimulation   22.271  ZoneA   Beforestimulation   3
    7   ZoneC   Afterstimulation    6.9444  ZoneA   Beforestimulation   4
    8   ZoneC   Afterstimulation    7.8125  ZoneC   Afterstimulation    4
    9   ZoneA   Afterstimulation    17.469  ZoneC   Afterstimulation    5
    10  ZoneA   Afterstimulation    6.9444  ZoneA   Afterstimulation    5
    11  ZoneA   Afterstimulation    24.705  ZoneA   Afterstimulation    5
    

    Then you only need to groupby by

    df.groupby(['Groupby','Zone','Timeframe']).mean()
    

    And the final output would be like

    Groupby Zone    Timeframe   dV
    1   ZoneA   Beforestimulation   39.12154433333333
    2   ZoneB   Beforestimulation   49.1979295
    3   ZoneA   Beforestimulation   34.212463
    4   ZoneC   Afterstimulation    7.378472
    5   ZoneA   Afterstimulation    16.373006333333333