Search code examples
kdb

Efficient method to query percentile in a list


I've come across the requirement to collect the percentiles from a list a few times:

  1. Within what percentile is a certain number?
  2. What is the nth percentile in a list?

I have written these methods to solve the issue:

/for 1:    
percentileWithinThreshold:{[threshold;list] (100 * count where list <= threshold) % count list};

/for 2:
thresholdForPercentile:{[percentile;list] (asc list)[-1 + "j"$((percentile % 100) * count list)]};

They work well for both use cases, but I was thinking this is a too common use case, so probably Q offers already something out of the box that does the same. Any idea if there already exists something else?


Solution

  • '100 xrank' generates percentiles.

      q) 100 xrank 1 2 3 4
      q) 0 25 50 75
    

    Solution for your second requirement:

      q) f:{ y (100 xrank y:asc y) bin x}
    

    Also, note that your second function result will not be always same as xrank. Reason for that is 'xrank' uses floor for fractional index output which is the normal scenario with calculating percentiles and your function round up the value and subtracts -1 which ensures that output will always be lesser-equal to input percentile. For example:

      q) thresholdForPercentile[63;til 21] / output 12 
      q) f[63;til 21] / output 13
    

    For first requirement, there is no inbuilt function. However you could improve your function if you keep your input list sorted because in that case you could use 'bin' function which runs faster on big lists.

      q) percentileWithinThreshold:{[threshold;list] (100 * 1+list bin threshold) % count list};
    

    Remember that 'bin' will throw type error if one argument is of float type and other is an integer. So make sure to cast them correctly inside the function.