Search code examples
pythonpython-3.xmathematical-expressions

How can I create a function from this data?


I have a dataset in the form of a table:

Score   Percentile
 381         1
 382         2
 383         2
      ...
 569        98
 570        99

The complete table is here as a Google spreadsheet.

Currently, I am computing a score and then doing a lookup on this dataset (table) to find the corresponding percentile rank.

Is it possible to create a function to calculate the corresponding percentile rank for a given score using a formula instead of looking it up in the table?


Solution

  • It's impossible to recreate the function that generated a given table of data, if no information is provided about the process behind that data.

    That being said, we can make some speculation.

    Since it's a "percentile" function, it probably represents the cumulative value of a probability distribution of some sort. A very common probability distribution is the normal distribution, whose "cumulative" counterpart (i.e. its integral) is the so called "error function" ("erf").

    In fact, your tabulated data looks a lot like an error function for a variable whose average value is 473.09:

    enter image description here

    your dataset: orange; fitted error function (erf): blue

    However, the agreement is not perfect and that could be because of three reasons:

    1. the fitting procedure I've used to generate the parameters for the error function didn't use the right constraints (because I have no idea what I'm modelling!)
    2. your dataset doesn't represent an exact normal distribution, but rather real world data whose underlying distribution is the normal distribution. The features of your sample data that deviate from the model are being ignored altogether.
    3. the underlying distribution is not a normal distribution at all, its integral just happens to look like the error function by chance.

    There is literally no way for me to tell!

    If you want to use this function, this is its definition:

    import numpy as np
    from scipy.special import erf
    def fitted_erf(x):
        c = 473.09090474
        w =  37.04826334
        return 50+50*erf((x-c)/(w*np.sqrt(2)))
    

    Tests:

    In [2]: fitted_erf(439) # 17 from the table
    Out[2]: 17.874052406601457
    
    In [3]: fitted_erf(457) # 34 from the table
    Out[3]: 33.20270318344252
    
    In [4]: fitted_erf(474) # 51 from the table
    Out[4]: 50.97883169390196
    
    In [5]: fitted_erf(502) # 79 from the table
    Out[5]: 78.23955071273468
    

    however I'd strongly advise you to check if a fitted function, made without knowledge of your data source, is the right tool for your task.


    P.S.

    In case you're interested, this is the code used to obtain the parameters:

    import numpy as np
    from scipy.special import erf
    from scipy.optimize import curve_fit
    
    tab=np.genfromtxt('table.csv', delimiter=',', skip_header=1)
    # using a 'table.csv' file generated by Google Spreadsheets
    x = tab[:,0]
    y = tab[:,1]
    
    def parametric_erf(x, c, w):
        return 50+50*erf((x-c)/(w*np.sqrt(2)))
    
    pars, j = curve_fit(parametric_erf, x, y, p0=[475,10])
    
    print(pars)
    # outputs [  473.09090474,   37.04826334]
    

    and to generate the plot

    import matplotlib.pyplot as plt
    
    plt.plot(x,parametric_erf(x,*pars))
    plt.plot(x,y)
    plt.show()