I have sets of Google Analytics data from a website which I plan to analyse for a project. However, due to maintenance and other factors, there are chunks of dates for which there is no data. I want to impute this data while still maintaining the integrity of the data as I plan to plot these sets and compare the curves of different sets to each-other over time.
I want to use the nearest valid datapoints to each missing datapoint to impute that value in order to maintain the underlying shape that can be seen from the image.
I've already tried to use scikit-learn's KNN-Imputer and Iterative Imputer but I'm either miss-understanding how these imputers are supposed to be used or they're not the correct for what I'm trying to do, potentially both.
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import numpy as np
df = pd.read_csv('data.csv', names=['Day','Views'],delimiter=',',skiprows=3, usecols=[0,1], skipfooter=1, engine='python', quoting= 1)
df = df.replace(0, np.nan)
da = df.Views.rename_axis('ID').values
da = da.reshape(-1,1)
imputer = IterativeImputer(n_nearest_features = 100, max_iter = 10)
df_imputed = imputer.fit_transform(da)
df_imputed.reshape(1,-1)
df.Views = df_imputed
df
All of the NaN values are calculated to be the exact same number from what I have currently implemented.
Any help would be greatly appreciated.
The problem here was I reshaping the array. My data was just a 1D array of values so I was making it 2D by reshaping the array which was causing all the NaN values to be calculated as the same. When I added an index column and included this as an input to the imputer the values were calculated correctly.I also ended up using a KNN imputer from sklearn instead of the iterative imputer in this instance.