python==3.8.10
pandas==2.0.3 (cannot update to python >= 3.9, so I am stuck to this version)
A dataframe contains some hourly data for a year. I filter the data to remove the months from April to October, and then aggregate by DAY. That should mean 151 days are left.
Is there any reason why MEAN and SUM are calculated on a different number of rows?
import pandas as pd
df = pd.read_parquet(BASE_DIR + "hour_fw_temp2022.parquet") # load hour data
df = df[(df.index.month < 4)|(df.index.month > 10)] # remove April to October
df = df.resample("1D").mean() # aggregate per day
df.describe()
returns
temp rh% fw
count 151.000000 151.000000 151.000000 <=== MEAN calculated on 151 days
mean 3.725442 77.780077 96365.618102
std 4.281750 12.593718 21074.110945
min -9.304167 25.500000 44666.666667
25% 1.254167 72.125000 82250.000000
50% 3.420833 80.125000 100083.333333
75% 6.814583 85.937500 109395.833333
max 14.091667 98.041667 166333.333333
while
df = df.resample("1D").sum() # aggregate per day
returns
temp rh% fw
count 365.000000 365.000000 3.650000e+02 <== SUM calculated on 365 days
mean 36.989041 772.260274 9.567918e+05
std 79.347405 940.838631 1.185907e+06
min -223.300000 0.000000 0.000000e+00 <== removed rows are filled with 0
25% 0.000000 0.000000 0.000000e+00 and all the stats are wrong...
50% 0.000000 0.000000 0.000000e+00
75% 53.600000 1849.000000 2.224000e+06
max 338.200000 2353.000000 3.992000e+06
By resampling, you equalize the frequency of the periodic index to a specified value. Any aggregation performed afterwards is based on the same data structure and returns the same number of records. The difference is in the behavior of an aggregation function. To force sum
to return Nan
instead of zeros for empty data, you may need to include an additional parameter min_count=1
, which is 0 by default:
df = df.resample("1D").sum(min_count=1)