Search code examples
pythonpandastime-seriespandas-groupbypandas-resample

Mean Aggregations using pandas GroupBy and Time Series resampling


I'm having trouble with Pandas groupby functionality and Time Series. I've read the documentation, but I can't see to figure out how to apply aggregate functions to multiple columns and calculate the mean of the volume (average) of the „aggregate “ correctly.

This is my code for import the CSV File:

#CSV Import
import pandas as pd
path = r'Z:\Python\30_Min_Data.txt'

from datetime import datetime
customdateparse = lambda x: datetime.strptime(x, '%Y/%m/%d %H:%M:%S.%f')
df = pd.read_csv(
        path,
        parse_dates={'DateTime': [0, 1]},
        date_parser=customdateparse)

# Set the Date as the Index --> needed for Resampling
df.set_index('DateTime', inplace=True)
df.sort_index()     

This is my DataFrame after the Import:

df
Out[3]: 
                     Volume Session
DateTime                           
2020-12-16 08:00:00    1000    PRTH
2020-12-16 08:30:00    5000    PRTH
2020-12-16 09:00:00    1000     RTH
2020-12-16 09:30:00    3000     RTH
2020-12-17 08:00:00    2000    PRTH
2020-12-17 08:30:00    2000    PRTH
2020-12-17 09:00:00    2000     RTH
2020-12-17 09:30:00    2000     RTH
2020-12-18 08:00:00    1000    PRTH
2020-12-18 08:30:00    1000    PRTH
2020-12-18 09:00:00    1000     RTH
2020-12-18 09:30:00    1000     RTH
2019-11-18 08:00:00    1000    PRTH
2019-11-18 08:30:00    1000    PRTH
2019-11-18 09:00:00    1000     RTH
2019-11-18 09:30:00    1000     RTH

This is what I tried: It calculates the averages per day, because of the Time Series Resampling. I would expect that it sums up the values first and at the end calculates the means. But it does the mean on the whole bunch of data per day.

#2.Volume: Average per Year & Session & Day
funcs_year    = lambda idx: idx.year
(df
   .groupby([funcs_year,'Session', pd.Grouper(freq='D')])
    ['Volume']
   .mean()
)

Out[6]: 
      Session   DateTime  
2019   PRTH     2019-11-18    1000
       RTH      2019-11-18    1000
2020   PRTH     2020-12-16    3000
                2020-12-17    2000
                2020-12-18    1000
       RTH      2020-12-16    2000
                2020-12-17    2000
                2020-12-18    1000
Name: Volume, dtype: int64

This how I want the result to be correctly calculated and displayed (i calculated it manually): Average (mean) volume traded per day (shown separately for the year and the session):

Year    Session     Mean Volume
2020    RTH         3.333,33
        PRTH        4.000,00
2019    RTH         2.000,00
        PRTH        2.000,00

Anyone know what I'm missing / doing wrong?


Solution

  • Following should also work, based on your question the 'sum' displays 'sum of Volume' based on 'Year' and 'mean' displays 'mean of volums' based on 'Daily mean' both being grouped by 'Session' and 'DateTime'. (Just used some groupy chaining with joins)

    import pandas as pd
    
    data = { 
    'DateTime':['2020-12-16 08:00:00','2020-12-16 08:30:00','2020-12-16 09:00:00','2020-12-16 09:30:00','2020-12-17 08:00:00','2020-12-17 08:30:00','2020-12-17 09:00:00','2020-12-17 09:30:00','2020-12-18 08:00:00','2020-12-18 08:30:00','2020-12-18 09:00:00','2020-12-18 09:30:00','2019-11-18 08:00:00','2019-11-18 08:30:00','2019-11-18 09:00:00','2019-11-18 09:30:00'],
    'Volume':[1000,500,1000,3000,2000,2000,2000,2000,1000,1000,1000,1000,1000,1000,1000,1000],
    'Session':['PRTH','PRTH','RTH','RTH','PRTH','PRTH','RTH','RTH','PRTH','PRTH','RTH','RTH','PRTH','PRTH','RTH','RTH']
    }
    
    df = pd.DataFrame(data)
    df['DateTime'] = pd.to_datetime(df['DateTime'])
    df.index = pd.to_datetime(df['DateTime'])
    
    
    #See below code 
    x = df.groupby([df.index.strftime('%Y'),'Session',df.index.strftime('%Y-%m-%d')]).agg({'Volume':['sum','mean']}).groupby(['DateTime','Session'],level=2).agg(['sum','mean'])
    x['Volume'].drop('mean',axis=1,level=0)