Search code examples
pythonpandaslambdagroup-bycalendar

Analyse historical data: average of every hour of every day over a number of years


I have historical data from 2012 to 2023. I am trying to calculate the average for every hour in every day over these years to build a 'reference year' or 'baseline'. My DataFrame (final_df) looks like this:

                Timestamp             bruttoPower [kW]      nettoPower [kW]     
0               2012-01-01 00:00:00   1501.500000           375.375000          
1               2012-01-01 01:00:00   1488.833333           372.208333          
2               2012-01-01 02:00:00   1626.833333           406.708333          
3               2012-01-01 03:00:00   1350.333333           337.583333          
4               2012-01-01 04:00:00   1424.000000           356.000000          
...             ...                   ...                   ...                 
105187          2023-12-31 19:00:00   3301.758391           763.500000          
105188          2023-12-31 20:00:00   2322.399977           592.500000              
105189          2023-12-31 21:00:00   3393.225006           861.000000              
105190          2023-12-31 22:00:00   3784.991643           952.000000              
105191          2023-12-31 23:00:00   3661.591654           950.500000          
105192 rows × 5 columns

I had already asked a similar question: Calculate mean values of the past x years for every month

But know I am trying to get the average of each hour on every day over the years (2012-2023).

With help of the answer I got on my last post and ChatGPT I tried this:

import calendar
import pandas as pd

baseline_df = (
    final_df.groupby(final_df.pop("Timestamp").dt.date)
    .mean()
    .reset_index()
    .rename(columns={"Timestamp":"Date"})
)

baseline_df["Date"] = baseline_df["Date"].apply(lambda x: x.strftime("%Y-%m-%d"))

display(baseline_df)

which prints:

Date        brutto Energie [kWh]    netto Energie [kWh] 
2012-01-01  3440.645833             3337.426458 
2012-01-02  5163.152778             5008.258194 
2012-01-03  8374.173611             8122.948403 
...         ...                     ...
2023-12-29  6554.490294             6471.125000 
2023-12-30  4346.124824             4291.750000 

and

import calendar
import pandas as pd

final_df['Date'] = final_df['Timestamp'].dt.date
final_df['Hour'] = final_df['Timestamp'].dt.hour

baseline_df = (final_df.groupby(['Date', 'Hour']).mean().reset_index())

baseline_df["Hour"] = baseline_df["Hour"].apply(lambda x: f"{x:02}")

display(baseline_df)

which prints:

Date        Hour    Timestamp             brutto Energie [kWh]  netto Energie [kWh] 
2012-01-01  00      2012-01-01 00:00:00   1501.500000           1456.455000 
2012-01-01  01      2012-01-01 01:00:00   1488.833333           1444.168333 
2012-01-01  02      2012-01-01 02:00:00   1626.833333           1578.028333 
...         ...     ...                   ...                   ...
2023-12-31  19      2023-12-31 19:00:00   3301.758391           3252.000000
2023-12-31  20      2023-12-31 20:00:00   2322.399977           2298.000000

but it is not working as I would like. I am trying to get something like this:

Timestamp          brutto Power [kW]   netto Power [kW]    
01-01 00:00:00     1501.500000         375.375000          
01-01 01:00:00     1488.833333         372.208333          
01-01 02:00:00     1626.833333         406.708333    
01-01 03:00:00     1350.333333         1309.823333              
...                ...                 ...                 
31-12 22:00:00     3393.225006         861.000000          
31-12 23:00:00     3784.991643         952.000000          

Solution

  • IIUC you want to have only one column "Timestamp" with one hour intervals:

    t = df.pop("Timestamp")
    
    out = (
        df.groupby([t.dt.date, t.dt.hour])
        .mean()
        .rename_axis(index=["Date", "Hour"])
        .reset_index()
    )
    
    out["Timestamp"] = pd.to_datetime(
        out.pop("Date").astype(str) + " " + out.pop("Hour").astype(str) + ":00"
    )
    
    print(out)
    

    Prints:

       bruttoPower [kW]  nettoPower [kW]           Timestamp
    0       1501.500000       375.375000 2012-01-01 00:00:00
    1       1488.833333       372.208333 2012-01-01 01:00:00
    2       1626.833333       406.708333 2012-01-01 02:00:00
    3       1350.333333       337.583333 2012-01-01 03:00:00
    4       1424.000000       356.000000 2012-01-01 04:00:00
    5       3301.758391       763.500000 2023-12-31 19:00:00
    6       2322.399977       592.500000 2023-12-31 20:00:00
    7       3393.225006       861.000000 2023-12-31 21:00:00
    8       3784.991643       952.000000 2023-12-31 22:00:00
    9       3661.591654       950.500000 2023-12-31 23:00:00
    

    OR: Group by month/day/hour:

    t = df.pop("Timestamp")
    
    out = (
        df.groupby([t.dt.month, t.dt.day, t.dt.hour])
        .mean()
        .rename_axis(index=["Month", "Day", "Hour"])
        .reset_index()
    )
    
    out["Timestamp"] = (
        out.pop("Day").astype(str).str.zfill(2)
        + "-"
        + out.pop("Month").astype(str).str.zfill(2)
        + " "
        + out.pop("Hour").astype(str).str.zfill(2)
        + ":00:00"
    )
    print(out)
    

    Prints:

       bruttoPower [kW]  nettoPower [kW]       Timestamp
    0       1501.500000       375.375000  01-01 00:00:00
    1       1488.833333       372.208333  01-01 01:00:00
    2       1626.833333       406.708333  01-01 02:00:00
    3       1350.333333       337.583333  01-01 03:00:00
    4       1424.000000       356.000000  01-01 04:00:00
    5       3301.758391       763.500000  31-12 19:00:00
    6       2322.399977       592.500000  31-12 20:00:00
    7       3393.225006       861.000000  31-12 21:00:00
    8       3784.991643       952.000000  31-12 22:00:00
    9       3661.591654       950.500000  31-12 23:00:00