Search code examples
pythonpandasdataframepandas-resample

Pandas resample behaves differently with SUM and MEAN


python==3.8.10
pandas==2.0.3 (cannot update to python >= 3.9, so I am stuck to this version)

A dataframe contains some hourly data for a year. I filter the data to remove the months from April to October, and then aggregate by DAY. That should mean 151 days are left.

Is there any reason why MEAN and SUM are calculated on a different number of rows?

import pandas as pd

df = pd.read_parquet(BASE_DIR + "hour_fw_temp2022.parquet")  # load hour data
df = df[(df.index.month < 4)|(df.index.month > 10)]   # remove April to October
df = df.resample("1D").mean()   # aggregate per day

df.describe()

returns

             temp         rh%             fw
count  151.000000  151.000000     151.000000  <=== MEAN calculated on 151 days
mean     3.725442   77.780077   96365.618102
std      4.281750   12.593718   21074.110945
min     -9.304167   25.500000   44666.666667
25%      1.254167   72.125000   82250.000000
50%      3.420833   80.125000  100083.333333
75%      6.814583   85.937500  109395.833333
max     14.091667   98.041667  166333.333333

while

df = df.resample("1D").sum()   # aggregate per day

returns

             temp          rh%            fw
count  365.000000   365.000000  3.650000e+02  <== SUM calculated on 365 days
mean    36.989041   772.260274  9.567918e+05
std     79.347405   940.838631  1.185907e+06
min   -223.300000     0.000000  0.000000e+00  <== removed rows are filled with 0
25%      0.000000     0.000000  0.000000e+00      and all the stats are wrong...
50%      0.000000     0.000000  0.000000e+00
75%     53.600000  1849.000000  2.224000e+06
max    338.200000  2353.000000  3.992000e+06

Solution

  • By resampling, you equalize the frequency of the periodic index to a specified value. Any aggregation performed afterwards is based on the same data structure and returns the same number of records. The difference is in the behavior of an aggregation function. To force sum to return Nan instead of zeros for empty data, you may need to include an additional parameter min_count=1, which is 0 by default:

    df = df.resample("1D").sum(min_count=1)
    

    See documentation for detailes