Search code examples
pythontime-seriesinterpolationfillnamerging-data

Time Series fill NAs for big time gaps (more than a month) based on related data from other years


I have a Time Series dataset with daily observations for the period 2015-2022. However, years 2015-2018 have data only for half year: from January to June (all days from July to December are NAs). But years 2019-2022 have date for all months and all days. I would like to fill all NA for years with missing months based on data from 2019-2022. In other words, I would like to calculate mean values for each day of missing months based on existing values for other years. Then, I need to fill NA's with corresponding calculated means.

Here is the code for creating a DataFrame which represent my dataset and existing problem:

import numpy as np
import pandas as pd


np.random.seed(1)

rows,cols = 1000,3 # dataframe with 1000 rows, each raw represents data for one day (from 1st January 2019 to 26th September 2021)
data = np.random.rand(rows,cols) 
tidx = pd.date_range('2019-01-01', periods=rows, freq='D') 
df = pd.DataFrame(data, columns=['a','b','c'], index=tidx) #there are 3 columns with data
date_list =pd.date_range(start='2019-07-01', end='2019-08-31') 
df[df.index.isin(date_list)] = np.nan #July and August have missing data for all days in 2019 year, but 2020 and 2021 have data for these months

print (df)

In simulated df presented above, I need to fill NAs for July and August of 2019 with mean values calculated for each day of these months based on corresponding data which are present for 2020 and 2021. I have tried various interpolation methods provided by pandas library, but results are very bad, because time periods with missing data are pretty big.


Solution

  • there are several steps to do this

    • Step 1 find days of the year that values are missing
    • Step 2 find the average values for the years 2020 and 2021 for the days of 2019 with missing values. This can easily be done using groupby and mean as aggregation function
    • step 3 fill the nan with the new values
    # step 1
    df['timestamp'] = pd.to_datetime(df.index)
    df["date_year"] = df.timestamp.dt.dayofyear
    mising_days = df[df.a.isna()]["date_year"].to_list()
    
    # step 2
    res = df[df.date_year.isin(mising_days)].groupby("date_year").mean()
    
    # step 3
    df.loc[df.a.isna(), ["a", "b", "c"]] = res.values
    df.drop(["timestamp", "date_year"], axis=1) # drop values used for the calculation