Search code examples
pythonpandasdataframetime-seriesinterpolation

How do I interpolate hourly datetime data as mean of prior and following day in pandas/


I have a pandas data frame containing a large-ish set of hourly data points. For a few days, there are missing data (NaN). I want to interpolate values for the missing hourly data points by calculating the mean of the same time period on the prior and following day (I've done some analysis and believe this will be reasonable).

An example of the data is below:

datetime value
2018-11-17 00:00:00 9.12
2018-11-17 01:00:00 8.94
2018-11-17 02:00:00 8.68
2018-11-17 03:00:00 8.19
2018-11-17 04:00:00 7.75
2018-11-17 05:00:00 7.35
2018-11-17 06:00:00 7.05
2018-11-17 07:00:00 6.55
2018-11-17 08:00:00 6.30
2018-11-17 09:00:00 6.28
2018-11-17 10:00:00 6.68
2018-11-17 11:00:00 7.64
2018-11-17 12:00:00 8.61
2018-11-17 13:00:00 9.44
2018-11-17 14:00:00 9.84
2018-11-17 15:00:00 9.62
2018-11-17 16:00:00 8.17
2018-11-17 17:00:00 6.16
2018-11-17 18:00:00 5.93
2018-11-17 19:00:00 5.36
2018-11-17 20:00:00 4.69
2018-11-17 21:00:00 4.36
2018-11-17 22:00:00 4.68
2018-11-17 23:00:00 4.86
2018-11-18 00:00:00 NaN
2018-11-18 01:00:00 NaN
2018-11-18 02:00:00 NaN
2018-11-18 03:00:00 NaN
2018-11-18 04:00:00 NaN
2018-11-18 05:00:00 NaN
2018-11-18 06:00:00 NaN
2018-11-18 07:00:00 NaN
2018-11-18 08:00:00 NaN
2018-11-18 09:00:00 NaN
2018-11-18 10:00:00 NaN
2018-11-18 11:00:00 NaN
2018-11-18 12:00:00 NaN
2018-11-18 13:00:00 NaN
2018-11-18 14:00:00 NaN
2018-11-18 15:00:00 NaN
2018-11-18 16:00:00 NaN
2018-11-18 17:00:00 NaN
2018-11-18 18:00:00 NaN
2018-11-18 19:00:00 NaN
2018-11-18 20:00:00 NaN
2018-11-18 21:00:00 NaN
2018-11-18 22:00:00 NaN
2018-11-18 23:00:00 NaN
2018-11-19 00:00:00 3.19
2018-11-19 01:00:00 2.60
2018-11-19 02:00:00 2.29
2018-11-19 03:00:00 1.97
2018-11-19 04:00:00 2.19
2018-11-19 05:00:00 3.09
2018-11-19 06:00:00 4.32
2018-11-19 07:00:00 4.87
2018-11-19 08:00:00 5.14
2018-11-19 09:00:00 5.55
2018-11-19 10:00:00 6.34
2018-11-19 11:00:00 7.43
2018-11-19 12:00:00 8.18
2018-11-19 13:00:00 8.53
2018-11-19 14:00:00 8.45
2018-11-19 15:00:00 7.94
2018-11-19 16:00:00 6.87
2018-11-19 17:00:00 5.56
2018-11-19 18:00:00 4.65
2018-11-19 19:00:00 4.18
2018-11-19 20:00:00 3.97
2018-11-19 21:00:00 3.98
2018-11-19 22:00:00 4.01
2018-11-19 23:00:00 4.00

So, for example, the desired output for 2018-11-18 00:00:00 would be the mean of 9.12 and 3.19 = 6.16. And so on for the other hours of the day on 2018-11-18.

Is there a simple way to do this in pandas? Ideally with a method that could be applied to a whole column (feature) within a data frame, rather than having to slice out some of the data, transform it, and then replace (because honestly, it would be a lot quicker for me to do that in excel!).

Thanks in advance for your help.


Solution

  • Try:

    #make sure every hour is in the datetime
    df = df.set_index("datetime").resample("1h").last()
    
    #create a series of means averaging the values 24 hours before and after
    means = df["value"].shift(24).add(df["value"].shift(-24)).mul(0.5)
    
    #fill the NaN in df with means
    df["value"] = df["value"].combine_first(means)
    
    >>> df.iloc[24:48]
                         value
    datetime                  
    2018-11-18 00:00:00  6.155
    2018-11-18 01:00:00  5.770
    2018-11-18 02:00:00  5.485
    2018-11-18 03:00:00  5.080
    2018-11-18 04:00:00  4.970
    2018-11-18 05:00:00  5.220
    2018-11-18 06:00:00  5.685
    2018-11-18 07:00:00  5.710
    2018-11-18 08:00:00  5.720
    2018-11-18 09:00:00  5.915
    2018-11-18 10:00:00  6.510
    2018-11-18 11:00:00  7.535
    2018-11-18 12:00:00  8.395
    2018-11-18 13:00:00  8.985
    2018-11-18 14:00:00  9.145
    2018-11-18 15:00:00  8.780
    2018-11-18 16:00:00  7.520
    2018-11-18 17:00:00  5.860
    2018-11-18 18:00:00  5.290
    2018-11-18 19:00:00  4.770
    2018-11-18 20:00:00  4.330
    2018-11-18 21:00:00  4.170
    2018-11-18 22:00:00  4.345
    2018-11-18 23:00:00  4.430