Search code examples
pythonpandasdataframepandas-groupbyupdating

Grouping by unique IDs, applying a function, and updating a certain column for next groups


I have a data frame looking like this:


In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'match_id': ['m1', 'm1', 'm1', 'm1', 'm2', 'm2', 'm2', 'm2', 'm3', 'm3', 'm3', 'm3'],
   ...:                     'name':['peter', 'mike', 'jeff', 'john', 'alex', 'joe', 'jeff', 'peter', 'alex', 'peter', '
   ...: joe', 'tom' ],
   ...:                     'rank': [2, 3, 1, 4, 3, 1, 2, 4, 4, 3, 1, 2],
   ...:                     'rating': [100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100]})

In [3]: df
Out[3]:
    match_id    name    rank  rating
0          m1  peter     2     100
1          m1   mike     3     100
2          m1   jeff     1     100
3          m1   john     4     100
4          m2   alex     3     100
5          m2    joe     1     100
6          m2   jeff     2     100
7          m2  peter     4     100
8          m3   alex     4     100
9          m3  peter     3     100
10         m3    joe     1     100
11         m3    tom     2     100

It is about three matches with unique "match_id"s, the participants' names, their rank at the end of the match, and a default rating score manually set to 100 for the whole data frame.

I want to group the data based on "match_id"s and run a function for each match separately, but the function's output should be used for updating a column for next matches.

I want to use a function that calculates the players updated ratings after each match and puts it in a new column named "updated_rating". The function that I tired looks like this for the first match:

df = df.loc[df['match_id'] == 'm1']
N = len(df)
df['win_prob'] = 0.0
for i in range(N):
    for j in range(N):
        if i != j:
            df['S'] = (N - df['rank']) / ((N*(N-1))/2)
            df['win_prob'][i] += (1 / (1 + (10 ** ((df['rating'][i] - df['rating'][j])/400))))
            df['normalized_win_prob'] = df['win_prob']/(N*(N-1)/2)
            df['updated_rating'] = round(df['rating'] + (20 * (df['S'] - df['normalized_win_prob'])), 1)

This will do the trick for the first match and calculates the updated ratings as well as the probability of winning for each player based on their original ratings. However, I could not extend this to account for the following matches.

Since some players reappeared in the next matches, I want to update their ratings (based on calculated "updated_rating" column in the previous phase) and let the function do the job for the second match and for the third match after that.

So, for example, the output after the calculations for the first match would look like this:


match_id name rank rating  win_prob    S    normalized_win_prob  updated_rating
0   m1  peter   2   100     1.5     0.333333          0.25            101.7
1   m1  mike    3   100     1.5     0.166667          0.25             98.3
2   m1  jeff    1   100     1.5     0.500000          0.25            105.0
3   m1  john    4   100     1.5     0.000000          0.25             95.0

Any ideas on how to do this in an efficient way? My original data frame is way larger than this sample data frame, so my solution needs to be efficient.

Thanks


Solution

  • Here will my solution. Since your algorithm has to loop through the match_ids one by one, so we need the for-loop on grouped data first. Then to compute the win_prob, you have to go through each row and compute its related probability to win against other rows in the same match. It's not pretty. Can't think of a better way though :(

    df = pd.DataFrame({'match_id': ['m1', 'm1', 'm1', 'm1', 'm2', 'm2', 'm2', 'm2', 'm3', 'm3', 'm3', 'm3', 'm4', 'm4', 'm4', 'm4'],
                       'name':['peter', 'mike', 'jeff', 'john', 'alex', 'joe', 'jeff', 'peter', 'alex', 'peter', 'joe', 'tom', 'mike', 'john', 'tom', 'peter'],
                       'rank': [2, 3, 1, 4, 3, 1, 2, 4, 4, 3, 1, 2, 1, 3, 4, 2],
                       'rating': [100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100]})
    
    # Pre-compute variables that don't depend on ratings
    df['N'] = df.groupby('match_id')['name'].transform('count')
    df['total_comb'] = ((df['N']*(df['N']-1))/2)
    df['S'] = (df['N'] - df['rank']) / df['total_comb']
    
    # Initialize win_prob and updated_rating
    df['win_prob'] = np.zeros(len(df))
    df['updated_rating'] = df['rating']
    df['prev_rating'] = df['rating']
    
    grouped = df.groupby('match_id', sort=True)
    
    dfa = pd.DataFrame() #Final results will be stored here
    last_names = []
    #Loop through the match_ids from m1 to m2, m3. Note you can sort them when use 'groupby'
    for name, dfg in grouped:
        dfm = dfg.copy()
        # Update the 'updated_rating' coming from last match_id
        if len(last_names) > 0:
            dfm.drop(columns=['updated_rating'], inplace=True)
            df_last = dfa.loc[dfa['match_id'].isin(last_names),['name', 'updated_rating']]
            df_last.drop_duplicates(subset=['name'], keep='last', inplace=True)
            dfm = dfm.merge(df_last, left_on='name', right_on='name', how='left')
            dfm['prev_rating'] = np.where(np.isnan(dfm['updated_rating']), dfm['rating'], dfm['updated_rating'])
    
        # Compute current 'updated_rating'
        win_prob = []
        for index, row in dfm.iterrows():
            prob = np.sum(1.0/(1+10**((row['prev_rating'] - dfm['prev_rating'])/400)))-0.5 #subtract 0.5 to account for self
            win_prob.append(prob)
    
        dfm['win_prob'] = win_prob
        dfm['normalized_win_prob'] = dfm['win_prob']/dfm['total_comb']
        dfm['updated_rating'] = round(dfm['prev_rating'] + (20 * (dfm['S'] - dfm['normalized_win_prob'])), 1)
        last_names.append(name)
        dfa = pd.concat([dfa, dfm], sort=True)
    
    dfa   
    
    

    The output:

    N       S         match_id  name    normalized_win_prob prev_rating   rank  rating  total_comb  updated_rating  win_prob
    4   0.333333333     m1      peter   0.25                    100         2   100     6               101.7       1.5
    4   0.166666667     m1      mike    0.25                    100         3   100     6               98.3        1.5
    4   0.5             m1      jeff    0.25                    100         1   100     6               105         1.5
    4   0               m1      john    0.25                    100         4   100     6               95          1.5
    4   0.166666667     m2      alex    0.251606926             100         3   100     6               98.3        1.509641559
    4   0.5             m2      joe     0.251606926             100         1   100     6               105         1.509641559
    4   0.333333333     m2      jeff    0.24681015              105         2   100     6               106.7       1.480860898
    4   0               m2      peter   0.249975997             101.7       4   100     6               96.7        1.499855985
    4   0               m3      alex    0.251630798             98.3        4   100     6               93.3        1.509784788
    4   0.166666667     m3      peter   0.253165649             96.7        3   100     6               95          1.518993896
    4   0.5             m3      joe     0.245203608             105         1   100     6               110.1       1.47122165
    4   0.333333333     m3      tom     0.249999944             100         2   100     6               101.7       1.499999666
    4   0.5             m4      mike    0.249232493             98.3        1   100     6               103.3       1.495394959
    4   0.166666667     m4      john    0.252398303             95          3   100     6               93.3        1.514389819
    4   0               m4      tom     0.2459709               101.7       4   100     6               96.8        1.475825403
    4   0.333333333     m4      peter   0.252398303             95          2   100     6               96.6        1.514389819