Search code examples
pythonpandasdataframemissing-dataimputation

How to impute missing values in electricity time series data considering the previous and next 2 values on the same day and time previous 2 days?


I have a pandas dataframe like below:

meter IDs date 00:00 00:30 01:00 01:30 ....... 23:00 23:30
1 2020-09-01 0.30 0.40 0.41 0.42 ....... 0.47 0.39
1 2020-09-02 0.36 0.39 nan nan ....... 0.53 0.41
1 2020-09-03 0.26 0.33 0.38 nan ....... nan 0.45
1 2020-09-04 0.36 0.40 0.41 0.42 ....... 0.69 nan
1 2020-09-05 0.30 0.40 0.41 0.42 ....... 0.69 0.62
: : : : : : ....... : :
1 2020-12-31 0.30 0.40 0.41 0.42 ....... 0.69 0.62
2 2020-09-01 0.55 0.57 nan 0.60 ....... 0.68 0.62
2 2020-09-02 0.61 0.59 0.66 nan ....... nan nan
2 2020-09-03 0.64 nan 0.66 0 ....... 0.66 0.67
2 2020-09-04 0.58 0.65 0.65 0.71 ....... 0.69 0.66
2 2020-09-05 0.49 0.52 0.55 nan ....... nan 0.62
: : : : : : ....... : :
2 2020-12-31 0.48 0.51 nan 0.52 ....... 0.63 0.61

I want to fill the missing values for the individual meter IDs considering mean of the previous and next 2 values on the same day and same time of previous 2 days?

How can i fix this problem? Thanks in advance.


Solution

  • pandas' fillna() function is a built-in tool for these scenarios. You can use this reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html. I think you might be able to integrate the mean and previous values. But, for a simple example, if you want to replace all NaN elements in columns ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively, and have a df below:

         A    B   C    D
    0  NaN  2.0 NaN  0.0
    1  3.0  4.0 NaN  1.0
    2  NaN  NaN NaN  NaN
    3  NaN  3.0 NaN  4.0
    
    values = {"A": 0, "B": 1, "C": 2, "D": 3}
    df.fillna(value=values)
    

    Then you get:

        A    B    C    D
    0  0.0  2.0  2.0  0.0
    1  3.0  4.0  2.0  1.0
    2  0.0  1.0  2.0  3.0
    3  0.0  3.0  2.0  4.0
    

    Or, a custom filling function that iterates over the data frame might help:

    import pandas as pd
    
    df.sort_values(by=['meter IDs', 'date'], inplace=True)
    
    def fill_missing(df):
        for meter_id, group in df.groupby('meter IDs'):
            for col in df.columns[2:]:
                for i in range(len(df)):
                    if pd.isnull(df.at[i, col]):
                        prev_vals = group[(group['date'] < df.at[i, 'date']) & (group[col].notnull())].tail(2)[col]
                        next_vals = group[(group['date'] > df.at[i, 'date']) & (group[col].notnull())].head(2)[col]
                        mean_val = pd.concat([prev_vals, next_vals]).mean()
                        df.at[i, col] = mean_val
    
    fill_missing(df)
    print(df)