Welford's online variance algorithm, but for Interquartile Range?

Short Version

Welford's Online Algorithm lets you keep a running value for variance - meaning you don't have to keep all the values (e.g. in a memory constraned system).

Is there something similar for Interquartile Range (IQR)? An online algorithm that lets me know the middle 50% range without having to keep all historical values?

Long Version

Keeping a running average of data, where you are memory constrainted, is pretty easy:

Double sum
Int64 count

And from this you can compute the mean:

mean = sum / count

This allows hours, or years, of observations to quietly be collected, but only take up 16-bytes.

Welford's Algorithm for Variance

Normally when you want the variance (or standard deviation), you have to have all your readings, because you have to compute reading-mean for all previous readings:

Double sumOfSquaredError = 0;
foreach (Double reading in Readings)
   sumOfSquaredError += Math.Square(reading - mean);
Double variance = sumOfSquaredError / count

Which can add up to terrabytes of data over time, rather than taking up something like 16 bytes.

Which is why it was nice when Welford came up with an online algorithm for computing variance of a stream of readings:

It is often useful to be able to compute the variance in a single pass, inspecting each value x_i only once; for example, when the data is being collected without enough storage to keep all the values, or when costs of memory access dominate those of computation.

The algorithm for adding a new value to the running variance is:

void addValue(Double newValue) {
   Double oldMean = sum / count;
   sum += newValue;
   count += 1;
   Double newMean = sum / count;

   if (count > 1)
      variance = ((count-2)*variance + (newValue-oldMean)*(newValue-newMean)) / (count-1);
   else
      variance = 0;
}

How about an online algorithm for Interquartile Range (IQR)?

Interquartile Range (IRQ) is another method of getting the spread of data. It tells you how wide the middle 50% of the data is:

And from that people then generally draw a IQR BoxPlot:

Or at the very least, have the values Q1 and Q3.

Is there a way to calculate the Interquartile Range without having to keep all the recorded values?

In other words:

Is there something like Welford's online variance algorithm, but for Interquartile Range?

Knuth, Seminumerical Algorithms

You can find Welford's algorithm explained in Knuth's 2nd volume Seminumerical Algorithms:

(just in case anyone thought this isn't computer science or programming related)

Research Effort

Stackoverflow: Simple algorithm for online outlier detection of a generic time series
Stats: Simple algorithm for online outlier detection of a generic time series
Online outlier detection for data streams (IDEAS '11: Proceedings of the 15th Symposium on International Database Engineering & Applications, September 2011, Pages 88–96)
Stats: Robust outlier detection in financial timeseries
Stats: Online outlier detection
Distance-based outlier detection in data streams (Proceedings of the VLDB Endowment, Volume 9, Issue 12, August 2016, pp 1089–1100) pdf
Online Outlier Detection Over Data Streams (Hongyin Cui, Masters Thesis, 2005)

Solution

There's a useful paper by Ben-Haim, and Tom-Tov published in 2010 in the Journal of Machine Learning Research

A Streaming Parallel Decision Tree Algorithm

It describes an algoritm to automatically create a histogram from a online (streaming) data, that does not require unlimited memory.

you add a value to the history
the algorithm dynamically creates buckets
including bucket sizes

The paper is kind of dense (as all math papers are), but the algorithm is fairly simple.

Lets start with some sample data. For this answer i'll use digits of PI as the source for incoming floating point numbers:

Value
3.14159
2.65358
9.79323
8.46264
3.38327
9.50288
4.19716
9.39937
5.10582
0.97494
4.59230
7.81640
6.28620
8.99862
8.03482
5.34211
...

I will define that i want 5 bins in my histogram.

We add the first value (3.14159), which causes the first bin to be created:

Bin	Count
3.14159	1

The bins in this histogram don't have any width; they are purely a point:

And then we add the 2nd value (2.65358) to the histogram:

And we continue adding points, until we reach our arbitrary limit of 5 "buckets":

That is all 5 buckets filled.

We add our 6th value (9.50288) to the histogram, except that means we now have 6 buckets; but we decided we only want five:

Now is where the magic starts

In order to do the streaming part, and limit memory usage to less-than-infinity, we need to merge some of the "bins".

Look at each pair of bins - left-to-right - and see which two are closest together. Which in our case is these two buckets:

These two buckets are merged, and given a new x value dependant on their relative heights (i.e. counts)

y_new = y_left + y_right = 1 + 1 = 2
x_new = x_left × (y_left/y_new) + x_right×(y_right/y_new) = 3.14159×(1/2) + 3.38327×(1/2) = 3.26243

And now we repeat.

add a new value

merge the two neighboring buckets that are closet to each other
deciding on the new x position between based on their relative heights

Eventually giving you (although i screwed it up as i was doing it manually in Excel for this answer):

Practical Example

I wanted a histogram of 20 buckets. This allows me to extract some useful statistics. For a histogram of 11 buckets, containing 38,000 data points, it only requires 40 bytes of memory:

With these 20 buckets, i can now computer the Probably Density Function (PDF):

Bin	Count	PDF
2.113262834	3085	5.27630%
6.091181608	3738	6.39313%
10.13907062	4441	7.59548%
14.38268188	5506	9.41696%
18.92107481	6260	10.70653%
23.52148965	6422	10.98360%
28.07685659	5972	10.21396%
32.55801082	5400	9.23566%
36.93292359	4604	7.87426%
41.23715698	3685	6.30249%
45.62006198	3136	5.36353%
50.38765223	2501	4.27748%
55.34957161	1618	2.76728%
60.37095192	989	1.69149%
65.99939004	613	1.04842%
71.73292736	305	0.52164%
78.18427775	140	0.23944%
85.22261376	38	0.06499%
90.13115876	12	0.02052%
96.1987941	4	0.00684%

And with the PDF, you can now calculate the Expected Value (i.e. mean):

Bin	Count	PDF	EV
2.113262834	3085	5.27630%	0.111502092
6.091181608	3738	6.39313%	0.389417244
10.13907062	4441	7.59548%	0.770110873
14.38268188	5506	9.41696%	1.354410824
18.92107481	6260	10.70653%	2.025790219
23.52148965	6422	10.98360%	2.583505901
28.07685659	5972	10.21396%	2.86775877
32.55801082	5400	9.23566%	3.00694827
36.93292359	4604	7.87426%	2.908193747
41.23715698	3685	6.30249%	2.598965665
45.62006198	3136	5.36353%	2.446843872
50.38765223	2501	4.27748%	2.155321935
55.34957161	1618	2.76728%	1.531676732
60.37095192	989	1.69149%	1.021171415
65.99939004	613	1.04842%	0.691950026
71.73292736	305	0.52164%	0.374190474
78.18427775	140	0.23944%	0.187206877
85.22261376	38	0.06499%	0.05538763
90.13115876	12	0.02052%	0.018498245
96.1987941	4	0.00684%	0.006581183

Which gives:

Expected Value: 27.10543

Cumulative Density Function CDF

We can now also get the Cumulative Density Function (CDF):

Bin	Count	PDF	EV	CDF
2.113262834	3085	5.27630%	0.11150	5.27630%
6.091181608	3738	6.39313%	0.38942	11.66943%
10.13907062	4441	7.59548%	0.77011	19.26491%
14.38268188	5506	9.41696%	1.35441	28.68187%
18.92107481	6260	10.70653%	2.02579	39.38839%
23.52148965	6422	10.98360%	2.58351	50.37199%
28.07685659	5972	10.21396%	2.86776	60.58595%
32.55801082	5400	9.23566%	3.00695	69.82161%
36.93292359	4604	7.87426%	2.90819	77.69587%
41.23715698	3685	6.30249%	2.59897	83.99836%
45.62006198	3136	5.36353%	2.44684	89.36188%
50.38765223	2501	4.27748%	2.15532	93.63936%
55.34957161	1618	2.76728%	1.53168	96.40664%
60.37095192	989	1.69149%	1.02117	98.09814%
65.99939004	613	1.04842%	0.69195	99.14656%
71.73292736	305	0.52164%	0.37419	99.66820%
78.18427775	140	0.23944%	0.18721	99.90764%
85.22261376	38	0.06499%	0.05539	99.97264%
90.13115876	12	0.02052%	0.01850	99.99316%
96.1987941	4	0.00684%	0.00658	100.00000%

And the CDF is where we can start to get the values i want.

The median (50th percentile), where the CDF reaches 50%:

From interpolation of the data, we can find the x value where the CDF is 50%:

Bin	Count	PDF	EV	CDF
18.92107481	6260	10.70653%	2.02579	39.38839%
23.52148965	6422	10.98360%	2.58351	50.37199%

t = (50-39.38839)/(50.37199-39.38839) = 10.61161/10.9836 = 0.96613
x_median = (1-t)*18.93107481 + (t)*23.52148965 = 23.366

So now we know:

Expected Value (mean): 27.10543
Median: 23.366

My original ask was the IQV - the x values that account from 25%-75% of the values. Once again we can interpolate the CDF:

Bin	Count	PDF	EV	CDF
10.13907062	4441	7.59548%	0.77011	19.26491%
12.7235				25.00000%
14.38268188	5506	9.41696%	1.35441	28.68187%
23.366				50.00000% (median)
23.52148965	6422	10.98360%	2.58351	50.37199% (mode)
27.10543				mean
28.07685659	5972	10.21396%	2.86776	60.58595%
32.55801082	5400	9.23566%	3.00695	69.82161%
35.4351				75.00000%
36.93292359	4604	7.87426%	2.90819	77.69587%

This can be continued to get other useful stats:

Mean (aka average, expected value)
Median (50%)
Middle quintile (middle 20%)
IQR (middle 50% range)
middle 3 quintiles (middle 60%)
1 standard deviation range (middle 68.26%)
middle 80%
middle 90%
middle 95%
2 standard deviations range (middle 95.45%)
middle 99%
3 standard deviations range (middle 99.74%)

Short Version

https://github.com/apache/spark/blob/4c7888dd9159dc203628b0d84f0ee2f90ab4bf13/sql/catalyst/src/main/java/org/apache/spark/sql/util/NumericHistogram.java