Search code examples
pythonpandasdataframescipyscipy-optimize

New column in Pandas dataframe using least squares from scipy.optimize


I have a Pandas dataframe that looks like the following:

Race_ID   Date           Student_ID      feature1  
1         1/1/2023       3               0.02167131     
1         1/1/2023       4               0.17349148     
1         1/1/2023       6               0.08438952     
1         1/1/2023       8               0.04143787     
1         1/1/2023       9               0.02589056
1         1/1/2023       1               0.03866752     
1         1/1/2023       10              0.0461553     
1         1/1/2023       45              0.09212758
1         1/1/2023       23              0.10879326      
1         1/1/2023       102             0.186921      
1         1/1/2023       75              0.02990676      
1         1/1/2023       27              0.02731904      
1         1/1/2023       15              0.06020158      
1         1/1/2023       29              0.06302721                         
3         17/4/2022      5               0.2     
3         17/4/2022      2               0.1     
3         17/4/2022      3               0.55     
3         17/4/2022      4               0.15     

I would like to create a new column using the following method:

  1. Define the following function using integrals
import numpy as np
from scipy import integrate
from scipy.stats import norm
import scipy.integrate as integrate
from scipy.optimize import fsolve
from scipy.optimize import least_squares

def integrandforpi_i(xi, ti, *theta):
  prod = 1
  for t in theta:
    prod = prod * (1 - norm.cdf(xi - t))
  return prod * norm.pdf(xi - ti)

def pi_i(ti, *theta):
  return integrate.quad(integrandforpi_i, -np.inf, np.inf, args=(ti, *theta))[0]
  1. for each Race_ID, the value for each Student_ID in the new column is given by solving a system of nonlinear equations using least_squares in scipy.optimize as follows:

For example, for race 1, there are 14 students in the race and we will have to solve the following system of 14 nonlinear equations and we restrict the bound in between -2 and 2:

def equations(params):
    t1,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12,t13,t14 = params
    return (pi_i(t1,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12,t13,t14) - 0.02167131,
            pi_i(t2,t1,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12,t13,t14) - 0.17349148,
            pi_i(t3,t2,t1,t4,t5,t6,t7,t8,t9,t10,t11,t12,t13,t14) - 0.08438952,
            pi_i(t4,t2,t3,t1,t5,t6,t7,t8,t9,t10,t11,t12,t13,t14) - 0.04143787,
            pi_i(t5,t2,t3,t4,t1,t6,t7,t8,t9,t10,t11,t12,t13,t14) - 0.02589056,
            pi_i(t6,t2,t3,t4,t5,t1,t7,t8,t9,t10,t11,t12,t13,t14) - 0.03866752,
            pi_i(t7,t2,t3,t4,t5,t6,t1,t8,t9,t10,t11,t12,t13,t14) - 0.0461553,
            pi_i(t8,t2,t3,t4,t5,t6,t7,t1,t9,t10,t11,t12,t13,t14) - 0.09212758,
            pi_i(t9,t2,t3,t4,t5,t6,t7,t8,t1,t10,t11,t12,t13,t14) - 0.10879326,
            pi_i(t10,t2,t3,t4,t5,t6,t7,t8,t9,t1,t11,t12,t13,t14) - 0.186921,
            pi_i(t11,t2,t3,t4,t5,t6,t7,t8,t9,t10,t1,t12,t13,t14) - 0.02990676,
            pi_i(t12,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t1,t13,t14) - 0.02731904,
            pi_i(t13,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12,t1,t14) - 0.06020158,
            pi_i(t14,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t13,t14,t1) - 0.06302721)

res = least_squares(equations, (1,1,1,1,1,1,1,1,1,1,1,1,1,1), bounds = ((-2,-2,-2,-2,-2,-2,-2,-2,-2,-2,-2,-2,-2,-2), (2,2,2,2,2,2,2,2,2,2,2,2,2,2)))

t1,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12,t13,t14 = res.x

Solving gives t1,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12,t13,t14 = [1.38473533 0.25616609 0.6935956 1.07314877 1.30201502 1.10781642 1.01839475 0.64349646 0.54630158 0.20719836 1.23347391 1.27642131 0.879412 0.83338882]

Similarly, for race 2, there are 4 students competing and we will have to solve the following system of 4 nonlinear equations:

def equations(params):
    t1,t2,t3,t4 = params
    return (pi_i(t1,t2,t3,t4) - 0.2,
            pi_i(t2,t1,t3,t4) - 0.1,
            pi_i(t3,t2,t1,t4) - 0.55,
            pi_i(t4,t2,t3,t1) - 0.15)

res = least_squares(equations, (1,1,1,1), bounds = ((-2,-2,-2,-2), (2,2,2,2)))

t1,t2,t3,t4 = res.x

which gives t1,t2,t3,t4 = [0.9209873 1.37615468 0.12293934 1.11735818].

Hence the desired outcome looks like

Race_ID   Date           Student_ID      feature1       new_column
1         1/1/2023       3               0.02167131     1.38473533
1         1/1/2023       4               0.17349148     0.25616609
1         1/1/2023       6               0.08438952     0.6935956
1         1/1/2023       8               0.04143787     1.07314877
1         1/1/2023       9               0.02589056     1.30201502
1         1/1/2023       1               0.03866752     1.10781642
1         1/1/2023       10              0.0461553      1.01839475
1         1/1/2023       45              0.09212758     0.64349646
1         1/1/2023       23              0.10879326     0.54630158
1         1/1/2023       102             0.186921       0.20719836
1         1/1/2023       75              0.02990676     1.23347391
1         1/1/2023       27              0.02731904     1.27642131 
1         1/1/2023       15              0.06020158     0.879412 
1         1/1/2023       29              0.06302721     0.83338882                    
3         17/4/2022      5               0.2            0.9209873
3         17/4/2022      2               0.1            1.37615468
3         17/4/2022      3               0.55           0.12293934
3         17/4/2022      4               0.15           1.11735818

I have no idea how to generate the new column. Also, my actual dataframe is much larger with many races so I would like to ask is there any way to speed up the computation too, thanks a lot.


Solution

  • Here is how to parametrize and automatize your regression for each group.

    First we load your dataset:

    import io
    import numpy as np
    import pandas as pd
    from scipy import integrate, stats, optimize
    
    data = pd.read_fwf(io.StringIO("""Race_ID   Date           Student_ID      feature1  
    1         1/1/2023       3               0.02167131     
    1         1/1/2023       4               0.17349148     
    1         1/1/2023       6               0.08438952     
    1         1/1/2023       8               0.04143787     
    1         1/1/2023       9               0.02589056
    1         1/1/2023       1               0.03866752     
    1         1/1/2023       10              0.0461553     
    1         1/1/2023       45              0.09212758
    1         1/1/2023       23              0.10879326      
    1         1/1/2023       102             0.186921      
    1         1/1/2023       75              0.02990676      
    1         1/1/2023       27              0.02731904      
    1         1/1/2023       15              0.06020158      
    1         1/1/2023       29              0.06302721                         
    3         17/4/2022      5               0.2     
    3         17/4/2022      2               0.1     
    3         17/4/2022      3               0.55     
    3         17/4/2022      4               0.15   """))
    

    We slightly rewrite your functions:

    def integrand(x, ti, *theta):
        product = 1.
        for t in theta:
            product = product * (1 - stats.norm.cdf(x - t))
        return product * stats.norm.pdf(x - ti)
    
    def integral(ti, *theta):
        return integrate.quad(integrand, -np.inf, np.inf, args=(ti, *theta))[0]
    

    Then we parametrize the system of equations:

    def change_order(parameters, i):
        return [parameters[i]] + parameters[0:i] + parameters[i+1:]
    
    def system(parameters, t):
        parameters = parameters.tolist()
        values = np.full(len(t), np.nan)
        for i in range(len(parameters)):
            parameters = change_order(parameters, i)
            values[i] = integral(*parameters) - t[i]
        return values
    

    At this stage we can solve any system of length n, now we create a function that allow us to use groupby in pandas:

    def solver(x):
        t = x["feature1"].values
        u = np.ones_like(t)
        solution = optimize.least_squares(system, u, bounds=[-2*u, 2*u], args=(t,))
        return solution.x
    

    And we apply it to the dataframe:

    data["new_column"] = data.groupby("Race_ID").apply(solver, include_groups=False).explode().values
    

    Which leads to:

        Race_ID       Date  Student_ID  feature1 new_column
    0         1   1/1/2023           3  0.021671   1.383615
    1         1   1/1/2023           4  0.173491    0.25823
    2         1   1/1/2023           6  0.084390   0.695116
    3         1   1/1/2023           8  0.041438   1.073675
    4         1   1/1/2023           9  0.025891   1.301445
    5         1   1/1/2023           1  0.038668   1.108209
    6         1   1/1/2023          10  0.046155   1.019114
    7         1   1/1/2023          45  0.092128   0.645103
    8         1   1/1/2023          23  0.108793   0.548053
    9         1   1/1/2023         102  0.186921   0.209302
    10        1   1/1/2023          75  0.029907    1.23329
    11        1   1/1/2023          27  0.027319   1.276228
    12        1   1/1/2023          15  0.060202   0.880537
    13        1   1/1/2023          29  0.063027   0.855987
    14        3  17/4/2022           5  0.200000   0.920987
    15        3  17/4/2022           2  0.100000   1.376155
    16        3  17/4/2022           3  0.550000   0.122939
    17        3  17/4/2022           4  0.150000   1.117358
    

    Indeed the whole operation is rather slow for high dimensional groups.

    There are few things you can do to speed up the whole process:

    • use numba for the solving step;
    • parallelize group solving;

    But it still is an heavy operation if your dataset is huge or have high dimensional groups.