python python-3.x mathematical-expressions

How can I create a function from this data?

I have a dataset in the form of a table:

Score   Percentile
 381         1
 382         2
 383         2
      ...
 569        98
 570        99

The complete table is here as a Google spreadsheet.

Currently, I am computing a score and then doing a lookup on this dataset (table) to find the corresponding percentile rank.

Is it possible to create a function to calculate the corresponding percentile rank for a given score using a formula instead of looking it up in the table?

Solution

It's impossible to recreate the function that generated a given table of data, if no information is provided about the process behind that data.

That being said, we can make some speculation.

Since it's a "percentile" function, it probably represents the cumulative value of a probability distribution of some sort. A very common probability distribution is the normal distribution, whose "cumulative" counterpart (i.e. its integral) is the so called "error function" ("erf").

In fact, your tabulated data looks a lot like an error function for a variable whose average value is 473.09:

^{your dataset: orange; fitted error function (erf): blue}

However, the agreement is not perfect and that could be because of three reasons:

the fitting procedure I've used to generate the parameters for the error function didn't use the right constraints (because I have no idea what I'm modelling!)
your dataset doesn't represent an exact normal distribution, but rather real world data whose underlying distribution is the normal distribution. The features of your sample data that deviate from the model are being ignored altogether.
the underlying distribution is not a normal distribution at all, its integral just happens to look like the error function by chance.

There is literally no way for me to tell!

If you want to use this function, this is its definition:

import numpy as np
from scipy.special import erf
def fitted_erf(x):
    c = 473.09090474
    w =  37.04826334
    return 50+50*erf((x-c)/(w*np.sqrt(2)))

Tests:

In [2]: fitted_erf(439) # 17 from the table
Out[2]: 17.874052406601457

In [3]: fitted_erf(457) # 34 from the table
Out[3]: 33.20270318344252

In [4]: fitted_erf(474) # 51 from the table
Out[4]: 50.97883169390196

In [5]: fitted_erf(502) # 79 from the table
Out[5]: 78.23955071273468

however I'd strongly advise you to check if a fitted function, made without knowledge of your data source, is the right tool for your task.

P.S.

In case you're interested, this is the code used to obtain the parameters:

import numpy as np
from scipy.special import erf
from scipy.optimize import curve_fit

tab=np.genfromtxt('table.csv', delimiter=',', skip_header=1)
# using a 'table.csv' file generated by Google Spreadsheets
x = tab[:,0]
y = tab[:,1]

def parametric_erf(x, c, w):
    return 50+50*erf((x-c)/(w*np.sqrt(2)))

pars, j = curve_fit(parametric_erf, x, y, p0=[475,10])

print(pars)
# outputs [  473.09090474,   37.04826334]

and to generate the plot

import matplotlib.pyplot as plt

plt.plot(x,parametric_erf(x,*pars))
plt.plot(x,y)
plt.show()