python pandas time-series nan missing-data

Pandas Dataframe fill missing values (NaN) from saisonal timerseries with values from day or week before

I have data from a heat device which is measuring the power consumption over one year.The Data have of course some NaN values (Around 600). I already resampled the data to 15 min time stamps. The Dataframe looks like follows:

                    timestamp  ...                            Power_consumption
    0 2021-06-01 00:15:00  ...                                       12.7
    1 2021-06-01 00:30:00  ...                                       12.5
    2 2021-06-01 00:45:00  ...                                       12.4
    3 2021-06-01 01:00:00  ...                                       12.3
    4 2021-06-01 01:15:00  ...                                       12.2
    5 2021-06-01 00:15:00  ...                                       12.7
    6 2021-06-01 00:30:00  ...                                       12.5
    7 2021-06-01 00:45:00  ...                                       12.4
    8 2021-06-01 01:00:00  ...                                       13.5
    9 2021-06-01 01:15:00  ...                                       12.2
.
.
.
   97 2021-06-02 00:45:00  ...                                       12.4
   98 2021-06-02 01:00:00  ...                                       12.3
   99 2021-06-02 01:15:00  ...                                       12.2
  100 2021-06-02 00:15:00  ...                                       12.7
  101 2021-06-02 00:30:00  ...                                       12.5
  102 2021-06-02 00:45:00  ...                                       NaN
  103 2021-06-02 01:00:00  ...                                       12.5
  104 2021-06-02 01:15:00  ...                                       12.2

I want to fill now e.g. the 2021-06-02 00:45:00 with the value from the day before at the same time (2021-06-01 00:45:00 ) and if this value is also NaN, I would like to take the value from exactly one week before, same time. Whats the best way to do so ?

Solution

Initialize the dataframe

df = pd.DataFrame([
["2021-06-01 00:15:00",12.7],
["2021-06-01 00:30:00",12.5],                                
["2021-06-01 00:45:00",2.4],                                
["2021-06-01 01:00:00",12.3],                                
["2021-06-01 01:15:00",2.2],                                
["2021-06-01 01:30:00",12.5],                                
["2021-06-01 01:45:00",12.4],                                
["2021-06-01 02:00:00",13.5],                                
["2021-06-01 02:15:00",12.2],

["2021-06-02 00:15:00",13.7],
["2021-06-02 00:30:00",13.5],                                
["2021-06-02 00:45:00",23.4],                                
["2021-06-02 01:00:00",13.3],                                
["2021-06-02 01:15:00",np.NaN],                                
["2021-06-02 01:30:00",np.NaN],                                
["2021-06-02 01:45:00",13.4],                                
["2021-06-02 02:00:00",13.5],                                
["2021-06-02 02:15:00",13.2],
], columns = ["timestamp", "Power_consumption"])

Replacing nans

# converting the column to pandas datetime
df["timestamp"] = pd.to_datetime(df["timestamp"])

# Apply time delta of 1 and 7 for the rows having null values
df["oneday"] = df.loc[df.Power_consumption.isna()].apply(lambda x: x["timestamp"] - timedelta(1), axis=1)
df["oneweek"] = df.loc[df.Power_consumption.isna()].apply(lambda x: x["timestamp"] - timedelta(7), axis=1)

# Set timestamp as index
df = df.set_index("timestamp")

# For the rows where the power consumption is null, oneday column will be set. Query such rows
prev_day_timestamp = df[df.oneday.notna()]["oneday"]

# Query and get the power consumption values of previous date from the dataframe
one_day_values = df.loc[df.index.isin(prev_day_timestamp), "Power_consumption"].reset_index()["Power_consumption"]

# Apply the value to the nan rows
df.loc[df.Power_consumption.isnull(), "Power_consumption"] = one_day_values.tolist()


# Drop the extra columns
df.drop(["oneday"], inplace=True, axis=1)

# Reset the index
df = df.reset_index()

At the end if you find nan, you can apply the same logic again with oneweek column.

Sample output

    timestamp   Power_consumption
0   2021-06-01 00:15:00 12.7
1   2021-06-01 00:30:00 12.5
2   2021-06-01 00:45:00 2.4
3   2021-06-01 01:00:00 12.3
4   2021-06-01 01:15:00 2.2
5   2021-06-01 01:30:00 12.5
6   2021-06-01 01:45:00 12.4
7   2021-06-01 02:00:00 13.5
8   2021-06-01 02:15:00 12.2
9   2021-06-02 00:15:00 13.7
10  2021-06-02 00:30:00 13.5
11  2021-06-02 00:45:00 23.4
12  2021-06-02 01:00:00 13.3
13  2021-06-02 01:15:00 2.2
14  2021-06-02 01:30:00 12.5
15  2021-06-02 01:45:00 13.4
16  2021-06-02 02:00:00 13.5
17  2021-06-02 02:15:00 13.2

may not be a most effective solution for a large input.. But it works :-)