Search code examples
python-polars

Best way to compute a rolling median absolution deviation (MAD) in polars (python)


What would be the most efficient way to compute, in polars (python-polars), a rolling median absolute deviation (MAD) over a given period over a given "groupby key" ?

The context/the dataframe

I have the following dataframe:

  • it contains 3 columns
    • 1 datetime column (type Datetime)
    • 1 group column (type UInt32)
    • 1 value column (type UInt32)
  • the dataframe contains ~13 millions rows
    • it contains some values from 1500 distinct groups at the granularity of 5mn over 1 month
    • 5mn data points for a full month is around 8640 points
    • 8640 points for 1500 groups => 8640 * 1500 ~= 13 millions data points
┌─────────────────────┬───────┬───────┐
│ datetime            ┆ group ┆ value │
│ ---                 ┆ ---   ┆ ---   │
│ datetime[μs]        ┆ u32   ┆ u32   │
╞═════════════════════╪═══════╪═══════╡
│ 2023-04-17 14:50:00 ┆ 1     ┆ 8     │
│ 2023-04-17 14:50:00 ┆ 2     ┆ 4     │
│ 2023-04-17 14:50:00 ┆ 3     ┆ 14893 │
│ 2023-04-17 14:50:00 ┆ 4     ┆ 15    │
│ …                   ┆ …     ┆ …     │
│ 2023-05-17 14:45:00 ┆ 1497  ┆ 3     │
│ 2023-05-17 14:45:00 ┆ 1498  ┆ 8193  │
│ 2023-05-17 14:45:00 ┆ 1499  ┆ 42    │
│ 2023-05-17 14:45:00 ┆ 1500  ┆ 15    │
└─────────────────────┴───────┴───────┘

The processing

I now want to compute a rolling median absolute deviation over a week of data (i.e. 2016 observations as we are working at 5mn granularity) over every group.

I am currently able to compute very efficiently a rolling standard deviation with such kind of "expression":

(
    df.
    with_columns(
        pl.col("value").rolling_std(2016).over("group").alias("std")
    )
)

It gives me an answer in less than a second!

Unfortunately, the standard deviation is too sensitive to "outliers" and we would need to switch to a median absolute deviation (i.e. take the median of the absolute distance to the median instead of the mean of the distance to the mean).

The current approach

I gave a try with a rolling_map + the computation of the median from the returned series but it is too slow (it is still processing after more than 5 minutes...compared to the 1s of the rolling_std).

(
    df
    .with_columns(
        pl.col("value").rolling_map(lambda s: (s-s.median()).median(), window_size=2016).over("group").alias("mad")
    )
)

Improving the performance

Is there a way to do it more efficiently and to reach the same performance of what we can have with rolling_std ?

(i understand that computing the mad can be slower than computing a standard deviation...but i mean having the same kind of order of magnitude...i.e. to be able to compute in several seconds).

UPDATE

As noticed by @Dean Mac Gregor, the original lambda i've put in The current approach does not give the expected result (it returns always 0).
The proper lambda to have the expected result is rather lambda s: (s-s.median()).abs().median() (the call to abs() was missing).
The performance issue remains.
How to compute it more efficiently ?


Solution

  • I think you can use .cumulative_eval() for this.

    Although: "Warning: This can be really slow as it can have O(n^2) complexity."

    s = pl.element().tail(2016)
    
    df.with_columns(mad = 
       pl.col("value")
         .cumulative_eval((s - s.median()).abs().median(), parallel=True)
         .over("group")
    )
    

    You will get all the values, as if you supplied min_periods=1 to the rolling method.