I have a dataset in the form of a table:
Score Percentile
381 1
382 2
383 2
...
569 98
570 99
The complete table is here as a Google spreadsheet.
Currently, I am computing a score and then doing a lookup on this dataset (table) to find the corresponding percentile rank.
Is it possible to create a function to calculate the corresponding percentile rank for a given score using a formula instead of looking it up in the table?
It's impossible to recreate the function that generated a given table of data, if no information is provided about the process behind that data.
That being said, we can make some speculation.
Since it's a "percentile" function, it probably represents the cumulative value of a probability distribution of some sort. A very common probability distribution is the normal distribution, whose "cumulative" counterpart (i.e. its integral) is the so called "error function" ("erf").
In fact, your tabulated data looks a lot like an error function for a variable whose average value is 473.09:
your dataset: orange; fitted error function (erf): blue
However, the agreement is not perfect and that could be because of three reasons:
There is literally no way for me to tell!
If you want to use this function, this is its definition:
import numpy as np
from scipy.special import erf
def fitted_erf(x):
c = 473.09090474
w = 37.04826334
return 50+50*erf((x-c)/(w*np.sqrt(2)))
Tests:
In [2]: fitted_erf(439) # 17 from the table
Out[2]: 17.874052406601457
In [3]: fitted_erf(457) # 34 from the table
Out[3]: 33.20270318344252
In [4]: fitted_erf(474) # 51 from the table
Out[4]: 50.97883169390196
In [5]: fitted_erf(502) # 79 from the table
Out[5]: 78.23955071273468
however I'd strongly advise you to check if a fitted function, made without knowledge of your data source, is the right tool for your task.
In case you're interested, this is the code used to obtain the parameters:
import numpy as np
from scipy.special import erf
from scipy.optimize import curve_fit
tab=np.genfromtxt('table.csv', delimiter=',', skip_header=1)
# using a 'table.csv' file generated by Google Spreadsheets
x = tab[:,0]
y = tab[:,1]
def parametric_erf(x, c, w):
return 50+50*erf((x-c)/(w*np.sqrt(2)))
pars, j = curve_fit(parametric_erf, x, y, p0=[475,10])
print(pars)
# outputs [ 473.09090474, 37.04826334]
and to generate the plot
import matplotlib.pyplot as plt
plt.plot(x,parametric_erf(x,*pars))
plt.plot(x,y)
plt.show()