I have a python script that I'm trying to duplicate in postgres. In python, I'm using scipy.stats.percentileofscore
, in postgresql I'm trying to use percent_rank
. However, the numbers don't come out the same, and I'm trying to figure out exactly why.
I've checked and the rank and the dataset size are the same in both cases, and I've tried all four "kinds" from the scipy. I can duplicate the scipy value manually, but I haven't figured out what postgres is actually doing.
All of the documentation on postgresql I've seen is just "how do I use it?" not "what EXACTLY is it doing?"
As an example:
I have a dataset with 554 entries (sql count(*)
agrees with python). The thing I'm actually looking at is number 306 (sql rank()
agrees with me looking at it in excel)
python gets:
rank 55.234657039711195
weak 55.23465703971119
strict 55.054151624548744
mean 55.14440433212996
I can manually verify all of those numbers in excel.
postgresql gets:
55.15370705
where is postgresql getting that number from?
I assume you mean percent_rank
, since there is no percentile_rank
. Then the documentation explains it:
Computes the relative rank of the hypothetical row, that is (rank - 1) / (total rows - 1). The value thus ranges from 0 to 1 inclusive.
scipy is giving you 306/554 (for rank
), while postgres is giving you 305/553. The difference is the coverage of the output interval; the postgres method will give you 0 for the lowest-ranked item in the set and 1 for the highest-ranked item, whereas scipy (with rank
or weak
) will give you 1/554 for the lowest-ranked item, 1 for the highest-ranked item, and 0 for values that are lower than any item in the set.
It looks like the postgres cume_dist
function should agree exactly with scipy's weak
mode:
Computes the cumulative distribution, that is (number of rows preceding or peers with hypothetical row) / (total rows). The value thus ranges from 1/N to 1.