Search code examples
pythonpandastime-seriesmissing-datalinear-interpolation

Can Pandas Linear interpolation capture seasonality patterns?


I have a time series dataset that contains 29,184 hours of data. There are around 1k+ rows of missing values.

Here is the link to the dataset: dropbox

Here is a small preview of the dataset:

                     NUMBER_OF_VEHICLES
DATE_TIME                              
2020-01-01 00:00:00                39.0
2020-01-01 01:00:00                 3.0
2020-01-01 02:00:00                 1.0
2020-01-01 03:00:00                 1.0
2020-01-01 04:00:00                 2.0

Here is a section of a plot that shows missing values: enter image description here

The column NUMBER_OF_VEHICLES contains both hourly and weekly seasonality. I am trying to meaningfully fill in the missing values in a way that captures the hourly and weekly seasonality too. Here are 2 approaches I did:

Approach 1

# Saving index of missing values
missing_ix = df[df['NUMBER_OF_VEHICLES'].isnull()].index

# Create HOUR and DAYOFWEEK features from datetime index
df= (
    df.assign(HOUR=lambda x: x.index.hour,
              DAYOFWEEK=lambda x: x.index.dayofweek)
)

# Linear interpolation
df['lin_impute'] = df['NUMBER_OF_VEHICLES'].interpolate("linear", limit_direction="both")

Here is the plot of the plot after linear interpolation. The indexes with missing values are shown in red color. enter image description here

The above approach does not capture the seasonality patterns. The missing data is filled in linearly.

Approach 2

# Saving index of missing values
missing_ix = df[df['NUMBER_OF_VEHICLES'].isnull()].index

# Create HOUR and DAYOFWEEK features from datetime index
df= (
    df.assign(HOUR=lambda x: x.index.hour,
              DAYOFWEEK=lambda x: x.index.dayofweek)
)

# Create a separate data frame with the mean number of vehicles per hour in a day.
hr_pattern = (
    df
    .groupby('HOUR', as_index=False)['NUMBER_OF_VEHICLES']
    .mean()
    .rename(columns={"NUMBER_OF_VEHICLES" : "hr_pattern"})
)

# Create a separate data frame with the mean number of vehicles per day of week.
week_pattern = (
    df
    .groupby('DAYOFWEEK', as_index=False)['NUMBER_OF_VEHICLES']
    .mean()
    .rename(columns={"NUMBER_OF_VEHICLES" : "week_pattern"})
)

# Merge with the main dataset
df_merged = (
    df.reset_index()
    .merge(hr_pattern, on='HOUR', how='inner')
    .merge(week_pattern, on='DAYOFWEEK', how='inner')
    .set_index('DATE_TIME')
)

# Linear interpolation
df_merged['lin_impute'] = df_merged['NUMBER_OF_VEHICLES'].interpolate("linear", limit_direction="both")

Now, here is the plot after approach 2: enter image description here

The above plot shows that the missing values have been filled in a way that shows the hourly and weekly patterns in the data.

My question is how did the 2nd approach lead to this result where the linearly interpolated time series captures both the hourly and weekly patterns in the data instead of filling in the missing values linearly?

Please help me to understand. Thank you so much for your help.


Solution

  • The short answer is no.

    What happens in the second case is that it creates two extra datasets with the mean values per hour and day of the week and joining them with the main dataset, so it fills some of the missing values with the mean of the whole dataset and then interpolating the rest of missing values.

    In other words, it seems it captures the seasonality because it populates some of the missing values by the mean per hour and day of the week, and then the missing values left are interpolated better with the linear interpolation because it works well when the adjacent data values to the missing value are not missing. That's why in the first case it's a straight line from the left adjacent value to the right one.