Search code examples
pythonpandastime-seriesnanmissing-data

Pandas Dataframe fill missing values (NaN) from saisonal timerseries with values from day or week before


I have data from a heat device which is measuring the power consumption over one year.The Data have of course some NaN values (Around 600). I already resampled the data to 15 min time stamps. The Dataframe looks like follows:

                    timestamp  ...                            Power_consumption
    0 2021-06-01 00:15:00  ...                                       12.7
    1 2021-06-01 00:30:00  ...                                       12.5
    2 2021-06-01 00:45:00  ...                                       12.4
    3 2021-06-01 01:00:00  ...                                       12.3
    4 2021-06-01 01:15:00  ...                                       12.2
    5 2021-06-01 00:15:00  ...                                       12.7
    6 2021-06-01 00:30:00  ...                                       12.5
    7 2021-06-01 00:45:00  ...                                       12.4
    8 2021-06-01 01:00:00  ...                                       13.5
    9 2021-06-01 01:15:00  ...                                       12.2
.
.
.
   97 2021-06-02 00:45:00  ...                                       12.4
   98 2021-06-02 01:00:00  ...                                       12.3
   99 2021-06-02 01:15:00  ...                                       12.2
  100 2021-06-02 00:15:00  ...                                       12.7
  101 2021-06-02 00:30:00  ...                                       12.5
  102 2021-06-02 00:45:00  ...                                       NaN
  103 2021-06-02 01:00:00  ...                                       12.5
  104 2021-06-02 01:15:00  ...                                       12.2

I want to fill now e.g. the 2021-06-02 00:45:00 with the value from the day before at the same time (2021-06-01 00:45:00 ) and if this value is also NaN, I would like to take the value from exactly one week before, same time. Whats the best way to do so ?


Solution

  • Initialize the dataframe

    df = pd.DataFrame([
    ["2021-06-01 00:15:00",12.7],
    ["2021-06-01 00:30:00",12.5],                                
    ["2021-06-01 00:45:00",2.4],                                
    ["2021-06-01 01:00:00",12.3],                                
    ["2021-06-01 01:15:00",2.2],                                
    ["2021-06-01 01:30:00",12.5],                                
    ["2021-06-01 01:45:00",12.4],                                
    ["2021-06-01 02:00:00",13.5],                                
    ["2021-06-01 02:15:00",12.2],
    
    ["2021-06-02 00:15:00",13.7],
    ["2021-06-02 00:30:00",13.5],                                
    ["2021-06-02 00:45:00",23.4],                                
    ["2021-06-02 01:00:00",13.3],                                
    ["2021-06-02 01:15:00",np.NaN],                                
    ["2021-06-02 01:30:00",np.NaN],                                
    ["2021-06-02 01:45:00",13.4],                                
    ["2021-06-02 02:00:00",13.5],                                
    ["2021-06-02 02:15:00",13.2],
    ], columns = ["timestamp", "Power_consumption"])
    

    Replacing nans

    # converting the column to pandas datetime
    df["timestamp"] = pd.to_datetime(df["timestamp"])
    
    # Apply time delta of 1 and 7 for the rows having null values
    df["oneday"] = df.loc[df.Power_consumption.isna()].apply(lambda x: x["timestamp"] - timedelta(1), axis=1)
    df["oneweek"] = df.loc[df.Power_consumption.isna()].apply(lambda x: x["timestamp"] - timedelta(7), axis=1)
    
    # Set timestamp as index
    df = df.set_index("timestamp")
    
    # For the rows where the power consumption is null, oneday column will be set. Query such rows
    prev_day_timestamp = df[df.oneday.notna()]["oneday"]
    
    # Query and get the power consumption values of previous date from the dataframe
    one_day_values = df.loc[df.index.isin(prev_day_timestamp), "Power_consumption"].reset_index()["Power_consumption"]
    
    # Apply the value to the nan rows
    df.loc[df.Power_consumption.isnull(), "Power_consumption"] = one_day_values.tolist()
    
    
    # Drop the extra columns
    df.drop(["oneday"], inplace=True, axis=1)
    
    # Reset the index
    df = df.reset_index()
    
    

    At the end if you find nan, you can apply the same logic again with oneweek column.

    Sample output

        timestamp   Power_consumption
    0   2021-06-01 00:15:00 12.7
    1   2021-06-01 00:30:00 12.5
    2   2021-06-01 00:45:00 2.4
    3   2021-06-01 01:00:00 12.3
    4   2021-06-01 01:15:00 2.2
    5   2021-06-01 01:30:00 12.5
    6   2021-06-01 01:45:00 12.4
    7   2021-06-01 02:00:00 13.5
    8   2021-06-01 02:15:00 12.2
    9   2021-06-02 00:15:00 13.7
    10  2021-06-02 00:30:00 13.5
    11  2021-06-02 00:45:00 23.4
    12  2021-06-02 01:00:00 13.3
    13  2021-06-02 01:15:00 2.2
    14  2021-06-02 01:30:00 12.5
    15  2021-06-02 01:45:00 13.4
    16  2021-06-02 02:00:00 13.5
    17  2021-06-02 02:15:00 13.2
    

    may not be a most effective solution for a large input.. But it works :-)