Building off and earlier question: Computing stats on generators in single pass. Python
As I mentioned before computing statistics from a generator in a single pass is extremely fast and memory efficient. Complex statistics and rank attributes like the 90th percentile and the nth smallest often need more complex work than standard deviation and averages (solved in the above). These approaches become very important when working with map/reduce jobs and large datasets where putting the data into a list or computing multiple passes becomes very slow.
The following is an O(n) quicksort style algorithm for looking up data based on rank order. Useful for finding medians, percentiles, quartiles, and deciles. Equivalent to data[n] when the data is already sorted. But needs all the data in a list that can be split/pivoted.
How can you compute medians, percentiles, quartiles, and deciles with a generator on a single pass?
The Quicksort style algorithm that needs a complete list
import random
def select(data, n):
"Find the nth rank ordered element (the least value has rank 0)."
data = list(data)
if not 0 <= n < len(data):
raise ValueError('not enough elements for the given rank')
while True:
pivot = random.choice(data)
pcount = 0
under, over = [], []
uappend, oappend = under.append, over.append
for elem in data:
if elem < pivot:
uappend(elem)
elif elem > pivot:
oappend(elem)
else:
pcount += 1
if n < len(under):
data = under
elif n < len(under) + pcount:
return pivot
else:
data = over
n -= len(under) + pcount
You will need to store large parts of the data. Up to the point where it may just pay off to store it completely. Unless you are willing to accept an approximate algorithm (which may be very reasonable when you know your data is independent).
Consider you need to find the median of the following data set:
0 1 2 3 4 5 6 7 8 9 -1 -2 -3 -4 -5 -6 -7 -8 -9
The median is obviously 0
. However, if you have seen only the first 10 elements, it is your worst guess at that time! So in order to find the median of an n element stream, you need to keep at least n/2
candidate elements in memory. And if you do not know the total size n
, you need to keep all!
Here are the medians for every odd-sized situation:
0 _ 1 _ 2 _ 3 _ 4 _ 4 _ 3 _ 2 _ 1 _ 0
While they were never candidates, you also need to remember the element 5 - 9:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
yields the median 9
. For every element in a series of size n I can find a continued series of size O(2*n) that has this element as median. But obviously, these series are not random / independent.
See "On-line" (iterator) algorithms for estimating statistical median, mode, skewness, kurtosis? for an overview of related methods.