I have a Time Series dataset with daily observations for the period 2015-2022. However, years 2015-2018 have data only for half year: from January to June (all days from July to December are NAs). But years 2019-2022 have date for all months and all days. I would like to fill all NA for years with missing months based on data from 2019-2022. In other words, I would like to calculate mean values for each day of missing months based on existing values for other years. Then, I need to fill NA's with corresponding calculated means.
Here is the code for creating a DataFrame which represent my dataset and existing problem:
import numpy as np
import pandas as pd
np.random.seed(1)
rows,cols = 1000,3 # dataframe with 1000 rows, each raw represents data for one day (from 1st January 2019 to 26th September 2021)
data = np.random.rand(rows,cols)
tidx = pd.date_range('2019-01-01', periods=rows, freq='D')
df = pd.DataFrame(data, columns=['a','b','c'], index=tidx) #there are 3 columns with data
date_list =pd.date_range(start='2019-07-01', end='2019-08-31')
df[df.index.isin(date_list)] = np.nan #July and August have missing data for all days in 2019 year, but 2020 and 2021 have data for these months
print (df)
In simulated df presented above, I need to fill NAs for July and August of 2019 with mean values calculated for each day of these months based on corresponding data which are present for 2020 and 2021. I have tried various interpolation methods provided by pandas library, but results are very bad, because time periods with missing data are pretty big.
there are several steps to do this
# step 1
df['timestamp'] = pd.to_datetime(df.index)
df["date_year"] = df.timestamp.dt.dayofyear
mising_days = df[df.a.isna()]["date_year"].to_list()
# step 2
res = df[df.date_year.isin(mising_days)].groupby("date_year").mean()
# step 3
df.loc[df.a.isna(), ["a", "b", "c"]] = res.values
df.drop(["timestamp", "date_year"], axis=1) # drop values used for the calculation