Search code examples
statisticsiqronline-algorithm

Welford's online variance algorithm, but for Interquartile Range?


Short Version

Welford's Online Algorithm lets you keep a running value for variance - meaning you don't have to keep all the values (e.g. in a memory constraned system).

Is there something similar for Interquartile Range (IQR)? An online algorithm that lets me know the middle 50% range without having to keep all historical values?

Long Version

Keeping a running average of data, where you are memory constrainted, is pretty easy:

  • Double sum
  • Int64 count

And from this you can compute the mean:

  • mean = sum / count

This allows hours, or years, of observations to quietly be collected, but only take up 16-bytes.

Welford's Algorithm for Variance

Normally when you want the variance (or standard deviation), you have to have all your readings, because you have to compute reading-mean for all previous readings:

Double sumOfSquaredError = 0;
foreach (Double reading in Readings)
   sumOfSquaredError += Math.Square(reading - mean);
Double variance = sumOfSquaredError / count

Which can add up to terrabytes of data over time, rather than taking up something like 16 bytes.

Which is why it was nice when Welford came up with an online algorithm for computing variance of a stream of readings:

It is often useful to be able to compute the variance in a single pass, inspecting each value xi only once; for example, when the data is being collected without enough storage to keep all the values, or when costs of memory access dominate those of computation.

The algorithm for adding a new value to the running variance is:

void addValue(Double newValue) {
   Double oldMean = sum / count;
   sum += newValue;
   count += 1;
   Double newMean = sum / count;

   if (count > 1)
      variance = ((count-2)*variance + (newValue-oldMean)*(newValue-newMean)) / (count-1);
   else
      variance = 0;
}

How about an online algorithm for Interquartile Range (IQR)?

Interquartile Range (IRQ) is another method of getting the spread of data. It tells you how wide the middle 50% of the data is:

enter image description here

And from that people then generally draw a IQR BoxPlot:

enter image description here

Or at the very least, have the values Q1 and Q3.

Is there a way to calculate the Interquartile Range without having to keep all the recorded values?

In other words:

Is there something like Welford's online variance algorithm, but for Interquartile Range?

Knuth, Seminumerical Algorithms

You can find Welford's algorithm explained in Knuth's 2nd volume Seminumerical Algorithms:

enter image description here

(just in case anyone thought this isn't computer science or programming related)

Research Effort


Solution

  • There's a useful paper by Ben-Haim, and Tom-Tov published in 2010 in the Journal of Machine Learning Research

    A Streaming Parallel Decision Tree Algorithm

    It describes an algoritm to automatically create a histogram from a online (streaming) data, that does not require unlimited memory.

    • you add a value to the history
    • the algorithm dynamically creates buckets
    • including bucket sizes

    The paper is kind of dense (as all math papers are), but the algorithm is fairly simple.

    Lets start with some sample data. For this answer i'll use digits of PI as the source for incoming floating point numbers:

    Value
    3.14159
    2.65358
    9.79323
    8.46264
    3.38327
    9.50288
    4.19716
    9.39937
    5.10582
    0.97494
    4.59230
    7.81640
    6.28620
    8.99862
    8.03482
    5.34211
    ...

    I will define that i want 5 bins in my histogram.

    We add the first value (3.14159), which causes the first bin to be created:

    Bin Count
    3.14159 1

    The bins in this histogram don't have any width; they are purely a point:

    enter image description here

    And then we add the 2nd value (2.65358) to the histogram:

    enter image description here

    And we continue adding points, until we reach our arbitrary limit of 5 "buckets":

    enter image description here

    That is all 5 buckets filled.

    We add our 6th value (9.50288) to the histogram, except that means we now have 6 buckets; but we decided we only want five:

    enter image description here

    Now is where the magic starts

    In order to do the streaming part, and limit memory usage to less-than-infinity, we need to merge some of the "bins".

    Look at each pair of bins - left-to-right - and see which two are closest together. Which in our case is these two buckets:

    enter image description here

    These two buckets are merged, and given a new x value dependant on their relative heights (i.e. counts)

    • ynew = yleft + yright = 1 + 1 = 2
    • xnew = xleft × (yleft/ynew) + xright×(yright/ynew) = 3.14159×(1/2) + 3.38327×(1/2) = 3.26243

    enter image description here

    And now we repeat.

    • add a new value

    enter image description here

    • merge the two neighboring buckets that are closet to each other
    • deciding on the new x position between based on their relative heights

    enter image description here

    Eventually giving you (although i screwed it up as i was doing it manually in Excel for this answer):

    enter image description here

    Practical Example

    I wanted a histogram of 20 buckets. This allows me to extract some useful statistics. For a histogram of 11 buckets, containing 38,000 data points, it only requires 40 bytes of memory:

    enter image description here

    With these 20 buckets, i can now computer the Probably Density Function (PDF):

    Bin Count PDF
    2.113262834 3085 5.27630%
    6.091181608 3738 6.39313%
    10.13907062 4441 7.59548%
    14.38268188 5506 9.41696%
    18.92107481 6260 10.70653%
    23.52148965 6422 10.98360%
    28.07685659 5972 10.21396%
    32.55801082 5400 9.23566%
    36.93292359 4604 7.87426%
    41.23715698 3685 6.30249%
    45.62006198 3136 5.36353%
    50.38765223 2501 4.27748%
    55.34957161 1618 2.76728%
    60.37095192 989 1.69149%
    65.99939004 613 1.04842%
    71.73292736 305 0.52164%
    78.18427775 140 0.23944%
    85.22261376 38 0.06499%
    90.13115876 12 0.02052%
    96.1987941 4 0.00684%

    enter image description here

    And with the PDF, you can now calculate the Expected Value (i.e. mean):

    Bin Count PDF EV
    2.113262834 3085 5.27630% 0.111502092
    6.091181608 3738 6.39313% 0.389417244
    10.13907062 4441 7.59548% 0.770110873
    14.38268188 5506 9.41696% 1.354410824
    18.92107481 6260 10.70653% 2.025790219
    23.52148965 6422 10.98360% 2.583505901
    28.07685659 5972 10.21396% 2.86775877
    32.55801082 5400 9.23566% 3.00694827
    36.93292359 4604 7.87426% 2.908193747
    41.23715698 3685 6.30249% 2.598965665
    45.62006198 3136 5.36353% 2.446843872
    50.38765223 2501 4.27748% 2.155321935
    55.34957161 1618 2.76728% 1.531676732
    60.37095192 989 1.69149% 1.021171415
    65.99939004 613 1.04842% 0.691950026
    71.73292736 305 0.52164% 0.374190474
    78.18427775 140 0.23944% 0.187206877
    85.22261376 38 0.06499% 0.05538763
    90.13115876 12 0.02052% 0.018498245
    96.1987941 4 0.00684% 0.006581183

    Which gives:

    • Expected Value: 27.10543

    enter image description here

    Cumulative Density Function CDF

    We can now also get the Cumulative Density Function (CDF):

    Bin Count PDF EV CDF
    2.113262834 3085 5.27630% 0.11150 5.27630%
    6.091181608 3738 6.39313% 0.38942 11.66943%
    10.13907062 4441 7.59548% 0.77011 19.26491%
    14.38268188 5506 9.41696% 1.35441 28.68187%
    18.92107481 6260 10.70653% 2.02579 39.38839%
    23.52148965 6422 10.98360% 2.58351 50.37199%
    28.07685659 5972 10.21396% 2.86776 60.58595%
    32.55801082 5400 9.23566% 3.00695 69.82161%
    36.93292359 4604 7.87426% 2.90819 77.69587%
    41.23715698 3685 6.30249% 2.59897 83.99836%
    45.62006198 3136 5.36353% 2.44684 89.36188%
    50.38765223 2501 4.27748% 2.15532 93.63936%
    55.34957161 1618 2.76728% 1.53168 96.40664%
    60.37095192 989 1.69149% 1.02117 98.09814%
    65.99939004 613 1.04842% 0.69195 99.14656%
    71.73292736 305 0.52164% 0.37419 99.66820%
    78.18427775 140 0.23944% 0.18721 99.90764%
    85.22261376 38 0.06499% 0.05539 99.97264%
    90.13115876 12 0.02052% 0.01850 99.99316%
    96.1987941 4 0.00684% 0.00658 100.00000%

    enter image description here

    And the CDF is where we can start to get the values i want.

    The median (50th percentile), where the CDF reaches 50%:

    enter image description here

    From interpolation of the data, we can find the x value where the CDF is 50%:

    Bin Count PDF EV CDF
    18.92107481 6260 10.70653% 2.02579 39.38839%
    23.52148965 6422 10.98360% 2.58351 50.37199%
    • t = (50-39.38839)/(50.37199-39.38839) = 10.61161/10.9836 = 0.96613
    • xmedian = (1-t)*18.93107481 + (t)*23.52148965 = 23.366

    So now we know:

    • Expected Value (mean): 27.10543
    • Median: 23.366

    My original ask was the IQV - the x values that account from 25%-75% of the values. Once again we can interpolate the CDF:

    Bin Count PDF EV CDF
    10.13907062 4441 7.59548% 0.77011 19.26491%
    12.7235 25.00000%
    14.38268188 5506 9.41696% 1.35441 28.68187%
    23.366 50.00000% (median)
    23.52148965 6422 10.98360% 2.58351 50.37199% (mode)
    27.10543 mean
    28.07685659 5972 10.21396% 2.86776 60.58595%
    32.55801082 5400 9.23566% 3.00695 69.82161%
    35.4351 75.00000%
    36.93292359 4604 7.87426% 2.90819 77.69587%

    enter image description here

    This can be continued to get other useful stats:

    • Mean (aka average, expected value)
    • Median (50%)
    • Middle quintile (middle 20%)
    • IQR (middle 50% range)
    • middle 3 quintiles (middle 60%)
    • 1 standard deviation range (middle 68.26%)
    • middle 80%
    • middle 90%
    • middle 95%
    • 2 standard deviations range (middle 95.45%)
    • middle 99%
    • 3 standard deviations range (middle 99.74%)

    Short Version