Search code examples
pythonpandastime-seriescumsum

Cleaning an erroneous cumulative sum with Pandas


I am trying to efficiently cleanup a Pandas time series that contains daily cumulative rain precipitation sum. Sometimes, multiple consecutive values are lower than the previous value. I am trying to replace those values with the last correct value.

Note that time series contains multiple days and therefore cumulative sum is reset for the first timestamp of each day.

Example data with erroneous values on first day at timestamps 2024-08-12 14:14:48+00:00 and 2024-08-12 14:19:55+00:00:

                           RR_TODAY                       day  day_change
timestamp                                                                
2024-08-12 14:04:57+00:00      0.00 2024-08-13 00:00:00+10:00        True
2024-08-12 14:09:58+00:00      1.50 2024-08-13 00:00:00+10:00       False
2024-08-12 14:14:48+00:00      1.40 2024-08-13 00:00:00+10:00       False
2024-08-12 14:19:55+00:00      1.40 2024-08-13 00:00:00+10:00       False
2024-08-12 14:24:58+00:00      1.50 2024-08-13 00:00:00+10:00       False
2024-08-12 14:24:58+00:00      1.60 2024-08-13 00:00:00+10:00       False
2024-08-14 13:39:59+00:00      0.91 2024-08-14 00:00:00+10:00       True
2024-08-14 13:44:52+00:00      1.01 2024-08-14 00:00:00+10:00       False
2024-08-14 13:49:56+00:00      1.40 2024-08-14 00:00:00+10:00       False
2024-08-14 13:54:51+00:00      9.91 2024-08-14 00:00:00+10:00       False
2024-08-14 13:59:53+00:00      9.91 2024-08-14 00:00:00+10:00       False

What I would like to end up with:

                           RR_TODAY                       day  day_change
timestamp                                                                
2024-08-12 14:04:57+00:00      0.00 2024-08-13 00:00:00+10:00        True
2024-08-12 14:09:58+00:00      1.50 2024-08-13 00:00:00+10:00       False
2024-08-12 14:14:48+00:00      1.50 2024-08-13 00:00:00+10:00       False
2024-08-12 14:19:55+00:00      1.50 2024-08-13 00:00:00+10:00       False
2024-08-12 14:24:58+00:00      1.50 2024-08-13 00:00:00+10:00       False
2024-08-12 14:24:58+00:00      1.60 2024-08-13 00:00:00+10:00       False
2024-08-14 13:39:59+00:00      0.91 2024-08-14 00:00:00+10:00       True
2024-08-14 13:44:52+00:00      1.01 2024-08-14 00:00:00+10:00       False
2024-08-14 13:49:56+00:00      1.40 2024-08-14 00:00:00+10:00       False
2024-08-14 13:54:51+00:00      9.91 2024-08-14 00:00:00+10:00       False
2024-08-14 13:59:53+00:00      9.91 2024-08-14 00:00:00+10:00       False

I came up with the following solution that works but it is highly inefficient:

def cleanup_RR_TODAY(df: pd.DataFrame) -> pd.Series:
    """
    Cleanup the RR_TODAY measure by setting the value to max day value if the value is less than that
    :param df: The DataFrame to cleanup
    :return: The cleaned up DataFrame
    """
    max_value = np.nan
    clean_series = pd.Series(dtype="float64")
    for index, value in df.iterrows():
        if (
            np.isnan(max_value)
            or value["day_change"]
            or (not np.isnan(value["RR_TODAY"]) and value["RR_TODAY"] > max_value)
        ):
            max_value = value["RR_TODAY"]
            clean_series[index] = max_value
        elif value["RR_TODAY"] < max_value:
            clean_series[index] = max_value
        else:
            clean_series[index] = value["RR_TODAY"]
    return clean_series


df["day"] = df.index.tz_convert(station.timezone).floor("D")
df["day_change"] = df["day"] != df["day"].shift(periods=1)
df["RR_TODAY"] = cleanup_RR_TODAY(df)

I am looking at a vectorised solution that would operate much faster


Solution

  • It seems like the following should work. It is vectorized and I suspect it should run faster at scale.

    df["day"] = df.index.tz_convert(station.timezone).floor("D")
    df["RR_MAX"] = (df.groupby("day")["RR_TODAY"]
                      .transform("cummax"))
    df["RR_TODAY"] = df[["RR_TODAY", "RR_MAX"]].max(axis = 1)
    

    the RR_MAX column keeps track of the maximum RR_TODAY number seen so far in the given day, with nan's zeroed out in order to not change the corresponding RR_TODAY entry.

    EDIT: Apparently, the pandas max function ignores nan's (unlike the numpy max function), so there's no need to zero them out.