Search code examples
algorithmmedian

Convert array into array of medians


I am pondering the solution to the following problem: I have a large array of integers (in even tougher case a stream of integers) and I want to convert that array into and array of medians, tht is it position corresponds to median of array [0..i].

Now, my brute force approach would be for each subarray, sort it and then select middle element. That is O(n^2 log n), as each n subarrays must be sorted in N log N time. I could probably lower the time to N^2 using some counting mechanism, but is N^2 the best that can be done?


Solution

  • Inserting a single item into an already sorted tree-like structure, like e.g. a red-black tree, will take O(log n). If you maintain the number of descendants for every node, then finding the median can be done in O(log n) as well. Do so for all n elements of your stream, and you have an O(n log n) algorithm.

    The data structure and median computation could look something like this:

    struct TreeNode {
      enum { RED, BLACK } color;
      size_t numElements;
      int value;
      TreeNode* left, right;
    } *root;
    
    TreeNode* treeSelect(TreeNode *top, size_t count) {
      if (top->left != NULL) {
        if (count < top->left->numElements)
          return treeSelect(top->left, count)
        count -= top->left->numElements;
      }
      if (count == 0)
        return top;
      return treeSelect(top->right, count - 1);
    }
    
    TreeNode* treeMedian() {
      return treeSelect(root, root->numElements / 2);
    }
    

    The other operations are those usually used for red-black trees, although you can skip those for removal of nodes. You can adjust them to work with duplicate elements. The general rule here is that when an element to be inserted is equal to the element of the current node, you may insert into any child tree. And when balancing the tree, the order of duplicate keys should be maintained, so that you maintain the order of the attached subtrees as well. But unless I miss something, balancing will work without value comparisons in any case, so once you've inserted duplicates, you are done. If you expect really many duplicate values, you might use a multimap-like approach instead, with a counter in each node.