Search code examples
pandasdatetimesumlarge-databetween

Pandas sum values between two dates in the most efficient way?


I have a dataset which shows production reported every week and another reporting the production every hours over some subproduction. I would now like to compare the sum of all this hourly subproduction with the value reported every week in the most efficient way. How could I achieve this? I would like to avoid a for loop at all cost as my dataset is really large.

So my datasest looks like this:

Weekly reported data:

Datetime_text             | Total_Production_A
--------------------------|--------------------
2014-12-08 00:00:00.000   | 8277000
2014-12-15 00:00:00.000   | 8055000
2014-12-22 00:00:00.000   | 7774000

Hourly data:

Datetime_text             | A_Prod_1  | A_Prod_2  | A_Prod_3  | ......    | A_Prod_N  |
--------------------------|-----------|-----------|-----------|-----------|-----------|
2014-12-06 23:00:00.000   |     454   |       9   |      54   |     104   |       4   | 
2014-12-07 00:00:00.000   |       0   |     NaV   |       0   |      23   |       3   | 
2014-12-07 01:00:00.000   |      54   |       0   |       4   |     NaV   |      20   |

and so on. I would like to a new table where the differnce between the weekly reported data and hourly reported data is calculated for all dates of weekly reported data. So something like this:

Datetime_text             | Diff_Production_A
--------------------------|------------------
2014-12-08 00:00:00.000   |                10
2014-12-15 00:00:00.000   |              -100
2014-12-22 00:00:00.000   |              1350

where Diff_Production_A = Total_Production_A - sum(A_Prod_1,A_Prod_2,A_Prod_3,...,A_Prod_N;over all datetimes of a week) How can I best achieve this?

Any help is this regard would be greatly appriciated :D

Best fidu13


Solution

  • Store datetime as pd.Timestamp, then you can do all kinds of manipulation on the dates.

    For your problem, they is to group the hourly data by week (starting on Mondays), then merge it with the weekly data and calculate the differences:

    weekly["Datetime"] = pd.to_datetime(weekly["Datetime_Text"])
    
    hourly["Datetime"] = pd.to_datetime(hourly["Datetime_Text"])
    hourly["HourlyTotal"] = hourly.loc[:, "A_Prod_1":"A_Prod_N"].sum(axis=1)
    
    result = (
        hourly.groupby(pd.Grouper(key="Datetime", freq="W-MON"))["HourlyTotal"]
        .sum()
        .to_frame()
        .merge(
            weekly[["Datetime", "Total_Production_A"]],
            how="outer",
            left_index=True,
            right_on="Datetime",
        )
        .assign(Diff=lambda x: x["Total_Production_A"] - x["HourlyTotal"])
    )