Search code examples
algorithmrolling-computation

Rolling Absolute Median Deviation Efficient Algorithm?


Is there an efficient algo for Rolling Absolute Median Deviation (MAD)? The Moving Window sliding over the Array X, and computing MAD for every array item.

It's possible to efficiently compute Rolling Median (with sorted list data structure, etc.).

But not MAD, because for every step, the median changes, and the deviations of every item in the sliding window had to be recalculated. Making algorithm complexity at least O(SlidingWindowSize * LengthOfArrayX).

Is there a more efficient algo? And if not - maybe there's algo with good approximation?

Brute force example, just for illustration purposes, I'm interested in the general algorithm:

import numpy as np

def median(data):
  return np.median(data)

def median_absolute_deviation(data):
  m = median(data)
  abs_deviation = [abs(x - m) for x in data]
  return median(abs_deviation)

def rolling_mad(data, window_size):
  result = []
  for i in range(len(data) - window_size + 1):
    window = data[i:i + window_size]
    result.append(median_absolute_deviation(window))
  return result

data = [10, 12, 11, 120, 14, 13, 15, 300, 18, 19]

rolling_mad_result = rolling_mad(data, window_size=4)

print(rolling_mad(data, window_size=4))

Solution

  • To calculate a rolling median, you maintain a division of the window elements into two groups -- those above the median and those below, and a way to get the smallest or largest in each group.

    Since each group is wholly on one side of the median or the other, if the group is sorted by element value, then it is also sorted by absolute deviation from the median. W.r.t the MAD, the two groups are sorted opposite directions. The above group is sorted in order of increasing MAD, while the bottom group is sorted in order of decreasing MAD.

    Finding the "median of two sorted arrays with binary search" is a pretty common leetcode-type question, and we could use that technique here to find the MAD if we keep our two groups sorted. This leads to an algorithm that takes O(log2 window_size) per position, since we will have to use a tree or something for each group instead of a simple sorted list. This is not bad, but O(log window_size) is possible.

    To do this efficiently, we just divide each of our groups into 2 again. When an element comes into the window, we first assign it to L? or H? according to whether it's below or above the current median, and then into ?L or ?H according to whether its deviation is below or above the current MAD. Our groups in element order are then: LH, LL, HL, HH.

    After an element is added to or removed from the window, then we need to adjust the sizes of the groups:

    1. If | |LH ∪ LL| - |HL ∪ HH| | > 1, then we need to move one element between LL and HL to reestablish the balance. Then
    2. While | |LL ∪ HL| - |LH ∪ HH| | > 1, then we need to move elements between LH and LL, or between HL and HH until the balance is reestablished. At most 2 moves will be required.

    In step (2), we have a choice of 2 moves to make. We just move the candidate that's closest to the current MAD, i.e., the one that changes the calculated MAD the least.

    We do the same rebalancing steps when an item is removed from the window.

    Finally, to calculate MAD, we first calculate the new median. It's the highest element in LL, the lowest element in HL, or the average of the two, depending on the median balance.

    Then we can calculate the MAD as the highest deviation in |LL ∪ HL|, the lowest deviation in |LH ∪ HH|, or the average of the two, depending on the MAD balance.

    All that remains is to choose a data structure for the groups. If your language's standard library comes with a balanced binary tree, then that is a good choice. Each of the group operations then takes O(log window_size) time, and there are a bounded number of those per window position.