Recently came across this question on how to find the xth percentile for a given stream of numbers. I have a basic understanding of how this could be achieved if the stream was relatively small (can be stored into memory, sorted and the xth value can be found) but I was wondering how the percentile could be approximated if the stream of numbers is fairly large and the number of numbers is unknown.
I think you could use Reservoir sampling to select uniformly k
elements from the stream S
and then approximate xth percentile of S
with the xth percentile of these k
numbers. k
depends on how much memory do you have and how precise the approximation should be.
EDIT
Here is a code example to test the solution:
// create random stream of numbers
Random random = new Random(0);
List<Integer> stream = new ArrayList<Integer>();
for (int i = 0; i < 100000; ++i) {
stream.add((int) (random.nextGaussian() * 100 + 30));
}
// get approximate percentile
int k = 1000; // sample size
int x = 50; // percentile
// init priority queue for sampling
TreeMap<Double, Integer> queue = new TreeMap<Double, Integer>();
// sample k elements from stream
for (int val : stream) {
queue.put(random.nextDouble(), val);
if (queue.size() > k) {
queue.pollFirstEntry();
}
}
// get xth percentile from k samples
List<Integer> sample = new ArrayList<Integer>(queue.values());
Collections.sort(sample);
int approxPercent = sample.get(sample.size() * x / 100);
System.out.println("Approximate percentile: " + approxPercent);
// get real value of the xth percentile
Collections.sort(stream);
int percent = stream.get(stream.size() * x / 100);
System.out.println("Real percentile: " + percent);
The result is:
Approximate percentile: 29
Real percentile: 29
I got a pretty good approximation for every x
i used and currently i don't see why it can't be suitable for your case.