I have a pandas data frame containing a large-ish set of hourly data points. For a few days, there are missing data (NaN). I want to interpolate values for the missing hourly data points by calculating the mean of the same time period on the prior and following day (I've done some analysis and believe this will be reasonable).
An example of the data is below:
datetime | value |
---|---|
2018-11-17 00:00:00 | 9.12 |
2018-11-17 01:00:00 | 8.94 |
2018-11-17 02:00:00 | 8.68 |
2018-11-17 03:00:00 | 8.19 |
2018-11-17 04:00:00 | 7.75 |
2018-11-17 05:00:00 | 7.35 |
2018-11-17 06:00:00 | 7.05 |
2018-11-17 07:00:00 | 6.55 |
2018-11-17 08:00:00 | 6.30 |
2018-11-17 09:00:00 | 6.28 |
2018-11-17 10:00:00 | 6.68 |
2018-11-17 11:00:00 | 7.64 |
2018-11-17 12:00:00 | 8.61 |
2018-11-17 13:00:00 | 9.44 |
2018-11-17 14:00:00 | 9.84 |
2018-11-17 15:00:00 | 9.62 |
2018-11-17 16:00:00 | 8.17 |
2018-11-17 17:00:00 | 6.16 |
2018-11-17 18:00:00 | 5.93 |
2018-11-17 19:00:00 | 5.36 |
2018-11-17 20:00:00 | 4.69 |
2018-11-17 21:00:00 | 4.36 |
2018-11-17 22:00:00 | 4.68 |
2018-11-17 23:00:00 | 4.86 |
2018-11-18 00:00:00 | NaN |
2018-11-18 01:00:00 | NaN |
2018-11-18 02:00:00 | NaN |
2018-11-18 03:00:00 | NaN |
2018-11-18 04:00:00 | NaN |
2018-11-18 05:00:00 | NaN |
2018-11-18 06:00:00 | NaN |
2018-11-18 07:00:00 | NaN |
2018-11-18 08:00:00 | NaN |
2018-11-18 09:00:00 | NaN |
2018-11-18 10:00:00 | NaN |
2018-11-18 11:00:00 | NaN |
2018-11-18 12:00:00 | NaN |
2018-11-18 13:00:00 | NaN |
2018-11-18 14:00:00 | NaN |
2018-11-18 15:00:00 | NaN |
2018-11-18 16:00:00 | NaN |
2018-11-18 17:00:00 | NaN |
2018-11-18 18:00:00 | NaN |
2018-11-18 19:00:00 | NaN |
2018-11-18 20:00:00 | NaN |
2018-11-18 21:00:00 | NaN |
2018-11-18 22:00:00 | NaN |
2018-11-18 23:00:00 | NaN |
2018-11-19 00:00:00 | 3.19 |
2018-11-19 01:00:00 | 2.60 |
2018-11-19 02:00:00 | 2.29 |
2018-11-19 03:00:00 | 1.97 |
2018-11-19 04:00:00 | 2.19 |
2018-11-19 05:00:00 | 3.09 |
2018-11-19 06:00:00 | 4.32 |
2018-11-19 07:00:00 | 4.87 |
2018-11-19 08:00:00 | 5.14 |
2018-11-19 09:00:00 | 5.55 |
2018-11-19 10:00:00 | 6.34 |
2018-11-19 11:00:00 | 7.43 |
2018-11-19 12:00:00 | 8.18 |
2018-11-19 13:00:00 | 8.53 |
2018-11-19 14:00:00 | 8.45 |
2018-11-19 15:00:00 | 7.94 |
2018-11-19 16:00:00 | 6.87 |
2018-11-19 17:00:00 | 5.56 |
2018-11-19 18:00:00 | 4.65 |
2018-11-19 19:00:00 | 4.18 |
2018-11-19 20:00:00 | 3.97 |
2018-11-19 21:00:00 | 3.98 |
2018-11-19 22:00:00 | 4.01 |
2018-11-19 23:00:00 | 4.00 |
So, for example, the desired output for 2018-11-18 00:00:00 would be the mean of 9.12 and 3.19 = 6.16. And so on for the other hours of the day on 2018-11-18.
Is there a simple way to do this in pandas? Ideally with a method that could be applied to a whole column (feature) within a data frame, rather than having to slice out some of the data, transform it, and then replace (because honestly, it would be a lot quicker for me to do that in excel!).
Thanks in advance for your help.
Try:
#make sure every hour is in the datetime
df = df.set_index("datetime").resample("1h").last()
#create a series of means averaging the values 24 hours before and after
means = df["value"].shift(24).add(df["value"].shift(-24)).mul(0.5)
#fill the NaN in df with means
df["value"] = df["value"].combine_first(means)
>>> df.iloc[24:48]
value
datetime
2018-11-18 00:00:00 6.155
2018-11-18 01:00:00 5.770
2018-11-18 02:00:00 5.485
2018-11-18 03:00:00 5.080
2018-11-18 04:00:00 4.970
2018-11-18 05:00:00 5.220
2018-11-18 06:00:00 5.685
2018-11-18 07:00:00 5.710
2018-11-18 08:00:00 5.720
2018-11-18 09:00:00 5.915
2018-11-18 10:00:00 6.510
2018-11-18 11:00:00 7.535
2018-11-18 12:00:00 8.395
2018-11-18 13:00:00 8.985
2018-11-18 14:00:00 9.145
2018-11-18 15:00:00 8.780
2018-11-18 16:00:00 7.520
2018-11-18 17:00:00 5.860
2018-11-18 18:00:00 5.290
2018-11-18 19:00:00 4.770
2018-11-18 20:00:00 4.330
2018-11-18 21:00:00 4.170
2018-11-18 22:00:00 4.345
2018-11-18 23:00:00 4.430