Welford's Online Algorithm lets you keep a running value for variance - meaning you don't have to keep all the values (e.g. in a memory constraned system).
Is there something similar for Interquartile Range (IQR)? An online algorithm that lets me know the middle 50% range without having to keep all historical values?
Keeping a running average of data, where you are memory constrainted, is pretty easy:
Double sum
Int64 count
And from this you can compute the mean:
mean = sum / count
This allows hours, or years, of observations to quietly be collected, but only take up 16-bytes.
Normally when you want the variance (or standard deviation), you have to have all your readings, because you have to compute reading-mean
for all previous readings:
Double sumOfSquaredError = 0;
foreach (Double reading in Readings)
sumOfSquaredError += Math.Square(reading - mean);
Double variance = sumOfSquaredError / count
Which can add up to terrabytes of data over time, rather than taking up something like 16 bytes
.
Which is why it was nice when Welford came up with an online algorithm for computing variance of a stream of readings:
It is often useful to be able to compute the variance in a single pass, inspecting each value xi only once; for example, when the data is being collected without enough storage to keep all the values, or when costs of memory access dominate those of computation.
The algorithm for adding a new value to the running variance is:
void addValue(Double newValue) {
Double oldMean = sum / count;
sum += newValue;
count += 1;
Double newMean = sum / count;
if (count > 1)
variance = ((count-2)*variance + (newValue-oldMean)*(newValue-newMean)) / (count-1);
else
variance = 0;
}
Interquartile Range (IRQ) is another method of getting the spread of data. It tells you how wide the middle 50% of the data is:
And from that people then generally draw a IQR BoxPlot:
Or at the very least, have the values Q1
and Q3
.
Is there a way to calculate the Interquartile Range without having to keep all the recorded values?
In other words:
Is there something like Welford's online variance algorithm, but for Interquartile Range?
You can find Welford's algorithm explained in Knuth's 2nd volume Seminumerical Algorithms:
(just in case anyone thought this isn't computer science or programming related)
There's a useful paper by Ben-Haim, and Tom-Tov published in 2010 in the Journal of Machine Learning Research
A Streaming Parallel Decision Tree Algorithm
It describes an algoritm to automatically create a histogram from a online (streaming) data, that does not require unlimited memory.
The paper is kind of dense (as all math papers are), but the algorithm is fairly simple.
Lets start with some sample data. For this answer i'll use digits of PI as the source for incoming floating point numbers:
Value |
---|
3.14159 |
2.65358 |
9.79323 |
8.46264 |
3.38327 |
9.50288 |
4.19716 |
9.39937 |
5.10582 |
0.97494 |
4.59230 |
7.81640 |
6.28620 |
8.99862 |
8.03482 |
5.34211 |
... |
I will define that i want 5 bins in my histogram.
We add the first value (3.14159
), which causes the first bin to be created:
Bin | Count |
---|---|
3.14159 | 1 |
The bins in this histogram don't have any width; they are purely a point:
And then we add the 2nd value (2.65358
) to the histogram:
And we continue adding points, until we reach our arbitrary limit of 5 "buckets":
That is all 5 buckets filled.
We add our 6th value (9.50288
) to the histogram, except that means we now have 6 buckets; but we decided we only want five:
In order to do the streaming part, and limit memory usage to less-than-infinity, we need to merge some of the "bins".
Look at each pair of bins - left-to-right - and see which two are closest together. Which in our case is these two buckets:
These two buckets are merged, and given a new x value dependant on their relative heights (i.e. counts)
And now we repeat.
Eventually giving you (although i screwed it up as i was doing it manually in Excel for this answer):
I wanted a histogram of 20 buckets. This allows me to extract some useful statistics. For a histogram of 11 buckets, containing 38,000 data points, it only requires 40 bytes of memory:
With these 20 buckets, i can now computer the Probably Density Function (PDF):
Bin | Count | |
---|---|---|
2.113262834 | 3085 | 5.27630% |
6.091181608 | 3738 | 6.39313% |
10.13907062 | 4441 | 7.59548% |
14.38268188 | 5506 | 9.41696% |
18.92107481 | 6260 | 10.70653% |
23.52148965 | 6422 | 10.98360% |
28.07685659 | 5972 | 10.21396% |
32.55801082 | 5400 | 9.23566% |
36.93292359 | 4604 | 7.87426% |
41.23715698 | 3685 | 6.30249% |
45.62006198 | 3136 | 5.36353% |
50.38765223 | 2501 | 4.27748% |
55.34957161 | 1618 | 2.76728% |
60.37095192 | 989 | 1.69149% |
65.99939004 | 613 | 1.04842% |
71.73292736 | 305 | 0.52164% |
78.18427775 | 140 | 0.23944% |
85.22261376 | 38 | 0.06499% |
90.13115876 | 12 | 0.02052% |
96.1987941 | 4 | 0.00684% |
And with the PDF, you can now calculate the Expected Value (i.e. mean):
Bin | Count | EV | |
---|---|---|---|
2.113262834 | 3085 | 5.27630% | 0.111502092 |
6.091181608 | 3738 | 6.39313% | 0.389417244 |
10.13907062 | 4441 | 7.59548% | 0.770110873 |
14.38268188 | 5506 | 9.41696% | 1.354410824 |
18.92107481 | 6260 | 10.70653% | 2.025790219 |
23.52148965 | 6422 | 10.98360% | 2.583505901 |
28.07685659 | 5972 | 10.21396% | 2.86775877 |
32.55801082 | 5400 | 9.23566% | 3.00694827 |
36.93292359 | 4604 | 7.87426% | 2.908193747 |
41.23715698 | 3685 | 6.30249% | 2.598965665 |
45.62006198 | 3136 | 5.36353% | 2.446843872 |
50.38765223 | 2501 | 4.27748% | 2.155321935 |
55.34957161 | 1618 | 2.76728% | 1.531676732 |
60.37095192 | 989 | 1.69149% | 1.021171415 |
65.99939004 | 613 | 1.04842% | 0.691950026 |
71.73292736 | 305 | 0.52164% | 0.374190474 |
78.18427775 | 140 | 0.23944% | 0.187206877 |
85.22261376 | 38 | 0.06499% | 0.05538763 |
90.13115876 | 12 | 0.02052% | 0.018498245 |
96.1987941 | 4 | 0.00684% | 0.006581183 |
Which gives:
27.10543
We can now also get the Cumulative Density Function (CDF):
Bin | Count | EV | CDF | |
---|---|---|---|---|
2.113262834 | 3085 | 5.27630% | 0.11150 | 5.27630% |
6.091181608 | 3738 | 6.39313% | 0.38942 | 11.66943% |
10.13907062 | 4441 | 7.59548% | 0.77011 | 19.26491% |
14.38268188 | 5506 | 9.41696% | 1.35441 | 28.68187% |
18.92107481 | 6260 | 10.70653% | 2.02579 | 39.38839% |
23.52148965 | 6422 | 10.98360% | 2.58351 | 50.37199% |
28.07685659 | 5972 | 10.21396% | 2.86776 | 60.58595% |
32.55801082 | 5400 | 9.23566% | 3.00695 | 69.82161% |
36.93292359 | 4604 | 7.87426% | 2.90819 | 77.69587% |
41.23715698 | 3685 | 6.30249% | 2.59897 | 83.99836% |
45.62006198 | 3136 | 5.36353% | 2.44684 | 89.36188% |
50.38765223 | 2501 | 4.27748% | 2.15532 | 93.63936% |
55.34957161 | 1618 | 2.76728% | 1.53168 | 96.40664% |
60.37095192 | 989 | 1.69149% | 1.02117 | 98.09814% |
65.99939004 | 613 | 1.04842% | 0.69195 | 99.14656% |
71.73292736 | 305 | 0.52164% | 0.37419 | 99.66820% |
78.18427775 | 140 | 0.23944% | 0.18721 | 99.90764% |
85.22261376 | 38 | 0.06499% | 0.05539 | 99.97264% |
90.13115876 | 12 | 0.02052% | 0.01850 | 99.99316% |
96.1987941 | 4 | 0.00684% | 0.00658 | 100.00000% |
And the CDF is where we can start to get the values i want.
The median (50th percentile), where the CDF reaches 50%:
From interpolation of the data, we can find the x value where the CDF is 50%:
Bin | Count | EV | CDF | |
---|---|---|---|---|
18.92107481 | 6260 | 10.70653% | 2.02579 | 39.38839% |
23.52148965 | 6422 | 10.98360% | 2.58351 | 50.37199% |
So now we know:
27.10543
23.366
My original ask was the IQV - the x values that account from 25%-75% of the values. Once again we can interpolate the CDF:
Bin | Count | EV | CDF | |
---|---|---|---|---|
10.13907062 | 4441 | 7.59548% | 0.77011 | 19.26491% |
12.7235 | 25.00000% | |||
14.38268188 | 5506 | 9.41696% | 1.35441 | 28.68187% |
23.366 | 50.00000% (median) | |||
23.52148965 | 6422 | 10.98360% | 2.58351 | 50.37199% (mode) |
27.10543 | mean | |||
28.07685659 | 5972 | 10.21396% | 2.86776 | 60.58595% |
32.55801082 | 5400 | 9.23566% | 3.00695 | 69.82161% |
35.4351 | 75.00000% | |||
36.93292359 | 4604 | 7.87426% | 2.90819 | 77.69587% |
This can be continued to get other useful stats: