Search code examples
pythonpandasdataframealgorithmgroup-by

Permutation summation in Pandas dataframe growing super exponentially


I have a pandas dataframe that looks like

import pandas as pd

data = {
  "Race_ID": [2,2,2,2,2,5,5,5,5,5,5],
  "Student_ID": [1,2,3,4,5,9,10,2,3,6,5],
  "theta": [8,9,2,12,4,5,30,3,2,1,50]
}

df = pd.DataFrame(data)

And I would like to create a new column df['feature'] by the following method: with each Race_ID, suppose the Student_ID is equal to i, then we define feature to be

enter image description here

def f(thetak, thetaj, thetai, *theta):
  prod = 1;
  for t in theta:
    prod = prod * t;
  return ((thetai + thetaj) / (thetai + thetaj + thetai * thetak)) * prod 

where k,j,l are the Student_IDs within the same Race_ID such that k =/= i, j=/=i,k, l=/=k,j,i and theta_i is theta with Student_ID equals to i. So for example for Race_ID =2, Student_ID =1, we have feature equals to

f(2,3,1,4,5)+f(2,3,1,5,4)+f(2,4,1,3,5)+f(2,4,1,5,3)+f(2,5,1,3,4)+f(2,5,1,4,3)+f(3,2,1,4,5)+f(3,2,1,5,4)+f(3,4,1,2,5)+f(3,4,1,5,2)+f(3,5,1,2,4)+f(3,5,1,4,2)+f(4,2,1,3,5)+f(4,2,1,5,3)+f(4,3,1,2,5)+f(4,3,1,5,2)+f(4,5,1,2,3)+f(4,5,1,3,2)+f(5,2,1,3,4)+f(5,2,1,4,3)+f(5,3,1,2,4)+f(5,3,1,4,2)+f(5,4,1,2,3)+f(5,4,1,3,2)

which is equal to 299.1960138012742.

But as one quickly realises, the number of terms in the sum grows super exponentially with the number of students in a race: if there are n students in a race, then there are (n-1)! terms in the sum.

Fortunately, due to the symmetry property of f, we can reduce the number of terms to a mere (n-1)(n-2) terms by noting the following:

Let i,j,k be given and 1,2,3 (for example sake) be different from i,j,k (i.e. 1,2,3 is in *arg). Then f(k,j,i,1,2,3) = f(k,j,i,1,3,2) = f(k,j,i,2,1,3) = f(k,j,i,2,3,1) = f(k,j,i,3,1,2) = f(k,j,i,3,2,1). Hence we can reduce the number of terms if we just compute any one of the terms and then multiply it by (n-3)!

So for example, for Race_ID =5, Student_ID =9, there would have been 5!=120 terms to sum, but using the above symmetry property, we only have to sum 5x4 = 20 terms (5 choices for k, 4 choices for i and 1 (non-unique choice) for l's), namely

f(2,3,9,5,6,10)+f(2,5,9,3,6,10)+f(2,6,9,3,5,10)+f(2,10,9,3,5,6)+f(3,2,9,5,6,10)+f(3,5,9,3,6,10)+f(3,6,9,2,5,10)+f(3,10,9,2,5,6)+f(5,2,9,3,6,10)+f(5,3,9,2,6,10)+f(5,6,9,2,3,10)+f(5,10,9,2,3,6)+f(6,2,9,3,5,10)+f(6,3,9,2,5,10)+f(6,5,9,2,3,10)+f(6,10,9,2,3,5)+f(10,2,9,3,5,6)+f(10,3,9,2,5,6)+f(10,5,9,2,3,6)+f(10,6,9,2,3,5)

and the feature for student 9 in race 5 will be equal to the above sum times 3! = 53588.197759

So by question is: how do i write the sum for the above dataframe? I have computed the features by hand for checking and the desired outcome looks like:

import pandas as pd

data = {
  "Race_ID": [2,2,2,2,2,5,5,5,5,5,5],
  "Student_ID": [1,2,3,4,5,9,10,2,3,6,5],
  "theta": [8,9,2,12,4,5,30,3,2,1,50],
  "feature": [299.1960138012742, 268.93506341257876, 634.7909309816431, 204.18901708653254, 483.7234700875771, 53588.197759, 9395.539167178009, 78005.26224935807, 92907.8753942894, 118315.38359654899, 5600.243276203378]
}

df = pd.DataFrame(data)

Thank you so much.


Solution

  • This code can be tweaked and made faster:

    import pandas as pd
    import numpy as np
    from itertools import permutations
    
    data = {
      "Race_ID": [2,2,2,2,2,5,5,5,5,5,5],
      "Student_ID": [1,2,3,4,5,9,10,2,3,6,5],
      "theta": [8,9,2,12,4,5,30,3,2,1,50]
    }
    
    df = pd.DataFrame(data)
    

    Define functions:

    def f(thetak, thetaj, thetai, *theta):
      prod = 1;
      for t in theta:
        prod *= t;
      return ((thetai + thetaj) / (thetai + thetaj + thetai * thetak)) * prod 
    
    
    def student_race_theta(si, thetas, student):
        total = 0
        for ts in permutations(range(1, len(si)+1)):
            if si.iloc[ts[2]-1] == student:
                thetak, thetaj, thetai, *theta = [thetas.iloc[i-1] for i in ts]
                total += f(thetak, thetaj, thetai, *theta)
        return total
    
    def fxs(x):
        s = pd.Series()
        for i, r in x.iterrows():
            s[i] = student_race_theta(x['Student_ID'], x['theta'], r['Student_ID'])
        return s
        
    df['feature'] = df.groupby('Race_ID', group_keys=False)\
                      [['Race_ID', 'Student_ID', 'theta']].apply(fxs)
    
    print(df)
    

    Output:

        Race_ID  Student_ID  theta        feature
    0         2           1      8     299.196014
    1         2           2      9     268.935063
    2         2           3      2     634.790931
    3         2           4     12     204.189017
    4         2           5      4     483.723470
    5         5           9      5   53588.197759
    6         5          10     30    9395.539167
    7         5           2      3   78005.262249
    8         5           3      2   92907.875394
    9         5           6      1  118315.383597
    10        5           5     50    5600.243276