Search code examples
algorithmsamplingpercentileapproximation

How to approximate the xth percentile for a large unknown quantity of number


Recently came across this question on how to find the xth percentile for a given stream of numbers. I have a basic understanding of how this could be achieved if the stream was relatively small (can be stored into memory, sorted and the xth value can be found) but I was wondering how the percentile could be approximated if the stream of numbers is fairly large and the number of numbers is unknown.


Solution

  • I think you could use Reservoir sampling to select uniformly k elements from the stream S and then approximate xth percentile of S with the xth percentile of these k numbers. k depends on how much memory do you have and how precise the approximation should be.


    EDIT

    Here is a code example to test the solution:

    // create random stream of numbers
    Random random = new Random(0);
    List<Integer> stream = new ArrayList<Integer>();
    for (int i = 0; i < 100000; ++i) {
        stream.add((int) (random.nextGaussian() * 100 + 30));
    }
    // get approximate percentile
    int k = 1000; // sample size
    int x = 50; // percentile
    // init priority queue for sampling
    TreeMap<Double, Integer> queue = new TreeMap<Double, Integer>();
    // sample k elements from stream
    for (int val : stream) {
        queue.put(random.nextDouble(), val);
        if (queue.size() > k) {
            queue.pollFirstEntry();
        }
    }
    // get xth percentile from k samples
    List<Integer> sample = new ArrayList<Integer>(queue.values());
    Collections.sort(sample);
    int approxPercent = sample.get(sample.size() * x / 100);
    System.out.println("Approximate percentile: " + approxPercent);
    // get real value of the xth percentile
    Collections.sort(stream);
    int percent = stream.get(stream.size() * x / 100);
    System.out.println("Real percentile: " + percent);
    

    The result is:

    Approximate percentile: 29

    Real percentile: 29

    I got a pretty good approximation for every x i used and currently i don't see why it can't be suitable for your case.