What would be the most efficient way to compute, in polars (python-polars), a rolling median absolute deviation (MAD) over a given period over a given "groupby key" ?
I have the following dataframe:
datetime
column (type Datetime
)group
column (type UInt32
)value
column (type UInt32
)~13 millions
rows
values
from 1500
distinct groups
at the granularity of 5mn
over 1 month
5mn
data points for a full month is around 8640
points8640
points for 1500
groups => 8640 * 1500
~= 13 millions
data points┌─────────────────────┬───────┬───────┐
│ datetime ┆ group ┆ value │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ u32 ┆ u32 │
╞═════════════════════╪═══════╪═══════╡
│ 2023-04-17 14:50:00 ┆ 1 ┆ 8 │
│ 2023-04-17 14:50:00 ┆ 2 ┆ 4 │
│ 2023-04-17 14:50:00 ┆ 3 ┆ 14893 │
│ 2023-04-17 14:50:00 ┆ 4 ┆ 15 │
│ … ┆ … ┆ … │
│ 2023-05-17 14:45:00 ┆ 1497 ┆ 3 │
│ 2023-05-17 14:45:00 ┆ 1498 ┆ 8193 │
│ 2023-05-17 14:45:00 ┆ 1499 ┆ 42 │
│ 2023-05-17 14:45:00 ┆ 1500 ┆ 15 │
└─────────────────────┴───────┴───────┘
I now want to compute a rolling median absolute deviation over a week of data (i.e. 2016 observations as we are working at 5mn granularity) over every group.
I am currently able to compute very efficiently a rolling standard deviation with such kind of "expression":
(
df.
with_columns(
pl.col("value").rolling_std(2016).over("group").alias("std")
)
)
It gives me an answer in less than a second!
Unfortunately, the standard deviation is too sensitive to "outliers" and we would need to switch to a median absolute deviation (i.e. take the median of the absolute distance to the median instead of the mean of the distance to the mean).
I gave a try with a rolling_map + the computation of the median from the returned series but it is too slow (it is still processing after more than 5 minutes...compared to the 1s
of the rolling_std).
(
df
.with_columns(
pl.col("value").rolling_map(lambda s: (s-s.median()).median(), window_size=2016).over("group").alias("mad")
)
)
Is there a way to do it more efficiently and to reach the same performance of what we can have with rolling_std
?
(i understand that computing the mad can be slower than computing a standard deviation...but i mean having the same kind of order of magnitude...i.e. to be able to compute in several seconds).
As noticed by @Dean Mac Gregor, the original lambda i've put in The current approach
does not give the expected result (it returns always 0).
The proper lambda to have the expected result is rather lambda s: (s-s.median()).abs().median()
(the call to abs()
was missing).
The performance issue remains.
How to compute it more efficiently ?
I think you can use .cumulative_eval()
for this.
Although: "Warning: This can be really slow as it can have O(n^2) complexity."
s = pl.element().tail(2016)
df.with_columns(mad =
pl.col("value")
.cumulative_eval((s - s.median()).abs().median(), parallel=True)
.over("group")
)
You will get all the values, as if you supplied min_periods=1
to the rolling method.