I was trying to calculate 10 percentiles for a list of chi-squared distributed values. I used "chi-squared" because I think this is closest to what our real data looks like.
Now I was trying to do this step-by-step to don´t miss anything.
import numpy as np
values = np.array([int(w)*10 for w in list(np.random.chisquare(6,1000))])
print('Min: ', np.max(values))
print('Max: ', np.min(values))
print('Mean: ', np.mean(values))
for p in [w*10 for w in range(1,11,1)]:
percentile = np.percentile(values,p)
print(p,percentile)
This is an example output of the code above:
Min: 0
Max: 230
Mean: 55.49
Percent: 10 Percentile: 20.0
Percent: 20 Percentile: 30.0
Percent: 30 Percentile: 30.0
Percent: 40 Percentile: 40.0
Percent: 50 Percentile: 50.0
Percent: 60 Percentile: 60.0
Percent: 70 Percentile: 70.0
Percent: 80 Percentile: 80.0
Percent: 90 Percentile: 100.0
Percent: 100 Percentile: 230.0
The point that I´m struggling at is:
why do I get the same "Percentile" for 20 & 30 percent?
I always thought that 20 / 30 means: 20 percent of the values lay below the following value (in this case 30). Like with 100 % of the values lay below 230 which is the maximum.
Which Idea am I missing?
Because values
was created with the expression int(w)*10
, all the values are integer multiples of 10. This means most of the values are repeated many times. For example, I just ran that code and found that the value 30 was repeated 119 times. It turns out that, when you count the values, the interquantile interval 20% - 30% contains only the value 30. That's why the values 30 is repeated in your output.
I can break down my data set as
value #
0 14
10 72
20 100
30 119
40 152
etc.
Break this up into groups of 100 (since you have 1000 values, and you are looking at 10%, 20%, etc).
np.percentile
Percent Group Values (counts) (largest value in previous column)
------- --------- ------------------------ ----------------------------------
10 0 - 99 0 (14), 10 (72), 20 (16) 20
20 100 - 199 20 (84), 30 (16) 30
30 200 - 299 30 (100) 30
40 300 - 399 30 (3), 40 (97) 40
etc.
Given the distribution that you used, this output seems to be the most likely, but if you rerun the code enough times, you'll encounter different output. I just ran it again and got
10 20.0
20 20.0
30 30.0
40 40.0
50 50.0
60 50.0
70 60.0
80 80.0
90 100.0
100 210.0
Note that both 20.0 and 50.0 are repeated. The counts of the values for this run are:
In [56]: values, counts = np.unique(values, return_counts=True)
In [57]: values
Out[57]:
array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120,
130, 140, 150, 160, 170, 180, 190, 210])
In [58]: counts
Out[58]:
array([ 14, 73, 129, 134, 134, 119, 105, 67, 73, 33, 41, 21, 19,
16, 8, 7, 1, 2, 2, 1, 1])