Search code examples
javaconcurrenthashmappercentile

How can I calculate percentile in java without using any other library


I am trying to calculate 95th Percentile from the data sets which I have populated in my below ConcurrentHashMap.

I am interested in finding out how many calls came back in 95th percentile of time

My Map will look like this and it will always be sorted in ascending order on the keys- In which

key - means number of milliseconds
value - means number of calls that took that much milliseconds

Below is my Map data-

Milliseconds    Number

0               1702
1               15036
2               14262
3               13190
4               9137
5               5635
6               3742
7               2628
8               1899
9               1298
10              963
11              727
12              503
13              415
14              311
15              235
16              204
17              140
18              109
19              83
20              72

For example, from the above data sets, it means

1702 calls came back in 0 milliseconds

15036 calls came back in 1 milliseconds

Now I can calculate the 95th percentile by plugging the above data sets in the Excel sheet. But I was thinking to calculate the percentile in Java code.

I know the algorithm will look something like this-

Sum all values from the map, calculate 95% of the sum, iterate the map keys in ascending order keeping a running total of values, and when sum equals or exceeds the previously calculated 95% of the total sum, the key should be the 95th percentile I guess.

Below is the map which will have above data sets.

Map<Long, Long> histogram = new ConcurrentHashMap<Long, Long>

I am not sure whether I am algorithm is also correct or not. I am just trying to find out how many calls came back in 95th percentile of time.

Below is the code I have got so far basis on my above algorithm.

private static void logPercentileInfo() {

    double total = 0;
    for (Map.Entry<Long, Long> entry : CassandraTimer.histogram.entrySet()) {
        long value = entry.getKey() * entry.getValue();
        total += value;
    }

    double sum = 0.95*total;

    double totalSum = 0;
    for (Map.Entry<Long, Long> entry : CassandraTimer.histogram.entrySet()) {
        totalSum += entry.getValue();

        if(totalSum >= sum) {
        System.out.println(entry.getKey());//this is the 95th percentile I guess
        }
    }
}

Let me know if I got everything correct in calculating the 95th percentile from my above data sets. If there is any improvement as well, please let me know.

Updated Code:-

Below is my updated code which solves the problem for ascending order of keys

/**
 * A simple method to log 95th percentile information
 */
private static void logPercentileInfo() {

    double total = 0;
    for (Map.Entry<Long, Long> entry : CassandraTimer.histogram.entrySet()) {
        long value = entry.getKey() * entry.getValue();
        total += value;
    }

    double sum = 0.95*total;

    double totalSum = 0;

    SortedSet<Long> keys = new TreeSet<Long>(CassandraTimer.histogram.keySet());
    for (long key : keys) {

        totalSum += CassandraTimer.histogram.get(key);

        if(totalSum >= sum) {
           //this is the 95th percentile I guess
            System.out.println(key);
        }
    }

}

Can anyone take a look and let me know whether I am calculating the percentile correctly or not?


Solution

  • From my comment on your question:

    Since you are using a hash map your keys aren't going to be stored in sorted order. I.e., if you print out entry.getKey() in your loop you are going to see that the keys are not in order. So that is your main problem. A TeeMap or ConcurrentSkipListMap will keep its keys in order

    changing Map<Long, Long> histogram = new ConcurrentHashMap<Long, Long>

    to

    Map<Long, Long> histogram = new ConcurrentSkipListMap<Long, Long>()

    will give you a map that'll return your keys in sorted order.

    Another issue in your code is when you calculate the total sum you do:

    total += entry.getKey() * entry.getValue(); // total += key*value

    and when you calculate the sum the second time around you are doing:

    totalSum += CassandraTimer.histogram.get(key); // totalSum += value

    I think you want to count up the total number of observations and then multiply that by 0.95. That'll give you the number of observations below the 95th percentile.

    L = .95 * total_observations

    Then iterate over your map, summing up the number of observations. Once the total number of observations exceeds L then the corresponding key is the value at the 95th percentile.

    private static void logPercentileInfo() {
        double total = 0;
        for (Map.Entry<Long, Long> entry : CassandraTimer.histogram.entrySet()) {
            long value = entry.getValue();
            total += value;
        }
    
        double sum = 0.95*total;
        double totalSum = 0;
    
        SortedSet<Long> keys = new TreeSet<Long>(CassandraTimer.histogram.keySet());
        for (long key : keys) {
    
            totalSum += CassandraTimer.histogram.get(key);
    
            if(totalSum >= sum) {
               System.out.println(key);
               break;
            }
        }
    }