Search code examples
pythonpostgresqlscipyrankpercentile

Getting PostgreSQL percent_rank and scipy.stats.percentileofscore results to match


I'm trying to QAQC the results of calculations that are done in a PostgreSQL database, using a python script to read in the inputs to the calculation and echo the calculation steps and compare the final results of the python script against the results from the PostgreSQL calculation.

The calculations in the PostgreSQL database use the percent_rank function, returning the percentile rank (from 0 to 1) of a single value in a list of values. In the python script I am using the Scipy percentileofscore function.

So, here's the question: I can't get the results to match, and I am wondering if anyone knows what settings I should use in the Scipy percentileofscore function to match the PostgreSQL percent_rank function.


Solution

  • You can use scipy.stats.rankdata. The following example reproduces the result shown at http://docs.aws.amazon.com/redshift/latest/dg/r_WF_PERCENT_RANK.html:

    In [12]: import numpy as np
    
    In [13]: from scipy.stats import rankdata
    
    In [14]: values = np.array([15, 20, 20, 20, 30, 30, 40])
    

    rankdata(values, method='min') gives the desired rank:

    In [15]: rank = rankdata(values, method='min')
    
    In [16]: rank
    Out[16]: array([1, 2, 2, 2, 5, 5, 7])
    

    Then a basic calculation gives the equivalent of percent_rank:

    In [17]: (rank - 1) / (len(values) - 1)
    Out[17]: 
    array([ 0.        ,  0.16666667,  0.16666667,  0.16666667,  0.66666667,
            0.66666667,  1.        ])
    

    (I'm using Python 3.5. In Python 2, use something like (rank - 1) / float(len(values) - 1).)


    You can use percentileofscore, but:

    • You have to use the argument kind='strict'.
    • You have to scale the result by n/(n-1), where n is the number of values.
    • You have to divide by 100 to convert from a true percentage to a fraction between 0 and 1.
    • percentileofscore expects its second argument to be a scalar, so you have to use a loop to compute the result separately for each value.

    Here's an example using the same values as above:

    In [87]: import numpy as np
    
    In [88]: from scipy.stats import percentileofscore
    
    In [89]: values = np.array([15, 20, 20, 20, 30, 30, 40])
    
    In [90]: n = len(values)
    

    Here I use a list comprehension to generate the result:

    In [91]: [n*percentileofscore(values, val, kind='strict')/100/(n-1) for val in values]
    Out[91]: 
    [0.0,
     0.16666666666666666,
     0.16666666666666666,
     0.16666666666666666,
     0.66666666666666663,
     0.66666666666666663,
     1.0]