Search code examples
c++algorithmmeanmedianforward-list

Algorithm for Fast Median Update


Suppose that at a point in time, you have a collection of N numbers and know the median element: M. Now, you're given a new value, X, and so you may need to update M. (Or rather, you will need to, assuming the numbers you're dealing with are all unique. Also, all the samples are received serially, so there's no issues with concurrency.)

Calculating the new mean is straightforward: take the old mean, add X, multiply by N, and divide by N + 1. (This is clear from inspecting how the average of N elements is defined. I'm not too worried about numerics, for the moment.)

My question is: can anyone suggest a creative/novel (or perhaps, provably-optimal) approach the problem of updating the median? I'll provide an example (simple idea of my own design) below, with a bit of analysis:

In this sample, I'll use a std::forward_list, since C++11 is where I encountered this most recently. Without loss of generality, I'm going to assume you're going about this the right way: maintaining an ordered list of the elements (type T's) encountered thus far, std::forward_list<T> sorted; When T x; appears, simply fold it into place using:

sorted.merge(std::forward_list<T> {{ x }});

By the way, I'm curious if anyone has a better (more efficient/elegant) method for this. Gripes welcome.

So, X is now part of sorted, and here's my idea, in a nutshell:

auto it = sorted.begin(), itend = sorted.end();
typename std::forward_list<T>::size_type count = std::distance(it, itend);
for (const auto &e : sorted) {
    if (it == itend || ++it == itend) {
        M = (count % 2) ? e : (e + M) / 2;
        break;
    } else { ++it; }
}

The nice (if not somewhat hard to see) thing that happens here is that: since you move the iterator forward twice (and safely, I might add, though at the price of two comparisons) for each element, when end() is reached, we'll be at the proper (median) value. If there's an odd number of elements, M is just that sample, if not, it's just the average of this element and the old (pushed-out) median. Because odd and even numbers alternate, either the old or new M will actually be in the collection. This reasoning is sound, yes?

You needn't comment on my O(3n) method if you think it's trash/yours is much better; I'm just suggesting it as a starting point.


Solution

  • You could use a std::set, and the fact that insertions to sets won't invalidate iterators.

    You can hold an iterator mIt to the set's median element if N is odd, and to the left of the two median elements, if N is even.

    Let's consider the different cases you can have when you insert elements:

    Insertion when N is odd: if the inserted element is smaller than *mIt, old median becomes right of the two new median elements, so decrement the iterator. If it's bigger (or equal, for multiset), all is well.
    Insertion when N is even: if the inserted element is bigger (or equal) than *mIt, the old right median becomes median, so increment the iterator. If it's smaller, the old left median becomes median, and all is well.

    template <class T>
    class MedianHolder {
      std::set<T> elements;
      std::set<T>::const_iterator mIt;
    
    public:
      T const& getMedian() const { return *mIt; }
    
      void insert(T const& t) {
        if (elements.empty()) {
          mIt = elements.insert(t).first;
          return;
        }
    
        bool smaller = std::less<T>(t,getMedian());
        bool odd = (elements.size() % 2) == 1;
    
        if (!elements.insert(t).second)
          return; //not inserted
    
        if (odd && smaller) --mIt;
        else if (!odd && !smaller) ++mIt;
      }
    };
    

    I'll leave erasing elements as an exercise to you ;-)