I have a data frame looking like this:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'match_id': ['m1', 'm1', 'm1', 'm1', 'm2', 'm2', 'm2', 'm2', 'm3', 'm3', 'm3', 'm3'],
...: 'name':['peter', 'mike', 'jeff', 'john', 'alex', 'joe', 'jeff', 'peter', 'alex', 'peter', '
...: joe', 'tom' ],
...: 'rank': [2, 3, 1, 4, 3, 1, 2, 4, 4, 3, 1, 2],
...: 'rating': [100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100]})
In [3]: df
Out[3]:
match_id name rank rating
0 m1 peter 2 100
1 m1 mike 3 100
2 m1 jeff 1 100
3 m1 john 4 100
4 m2 alex 3 100
5 m2 joe 1 100
6 m2 jeff 2 100
7 m2 peter 4 100
8 m3 alex 4 100
9 m3 peter 3 100
10 m3 joe 1 100
11 m3 tom 2 100
It is about three matches with unique "match_id"s, the participants' names, their rank at the end of the match, and a default rating score manually set to 100 for the whole data frame.
I want to group the data based on "match_id"s and run a function for each match separately, but the function's output should be used for updating a column for next matches.
I want to use a function that calculates the players updated ratings after each match and puts it in a new column named "updated_rating". The function that I tired looks like this for the first match:
df = df.loc[df['match_id'] == 'm1']
N = len(df)
df['win_prob'] = 0.0
for i in range(N):
for j in range(N):
if i != j:
df['S'] = (N - df['rank']) / ((N*(N-1))/2)
df['win_prob'][i] += (1 / (1 + (10 ** ((df['rating'][i] - df['rating'][j])/400))))
df['normalized_win_prob'] = df['win_prob']/(N*(N-1)/2)
df['updated_rating'] = round(df['rating'] + (20 * (df['S'] - df['normalized_win_prob'])), 1)
This will do the trick for the first match and calculates the updated ratings as well as the probability of winning for each player based on their original ratings. However, I could not extend this to account for the following matches.
Since some players reappeared in the next matches, I want to update their ratings (based on calculated "updated_rating" column in the previous phase) and let the function do the job for the second match and for the third match after that.
So, for example, the output after the calculations for the first match would look like this:
match_id name rank rating win_prob S normalized_win_prob updated_rating
0 m1 peter 2 100 1.5 0.333333 0.25 101.7
1 m1 mike 3 100 1.5 0.166667 0.25 98.3
2 m1 jeff 1 100 1.5 0.500000 0.25 105.0
3 m1 john 4 100 1.5 0.000000 0.25 95.0
Any ideas on how to do this in an efficient way? My original data frame is way larger than this sample data frame, so my solution needs to be efficient.
Thanks
Here will my solution. Since your algorithm has to loop through the match_ids one by one, so we need the for-loop
on grouped data first. Then to compute the win_prob
, you have to go through each row and compute its related probability to win against other rows in the same match. It's not pretty. Can't think of a better way though :(
df = pd.DataFrame({'match_id': ['m1', 'm1', 'm1', 'm1', 'm2', 'm2', 'm2', 'm2', 'm3', 'm3', 'm3', 'm3', 'm4', 'm4', 'm4', 'm4'],
'name':['peter', 'mike', 'jeff', 'john', 'alex', 'joe', 'jeff', 'peter', 'alex', 'peter', 'joe', 'tom', 'mike', 'john', 'tom', 'peter'],
'rank': [2, 3, 1, 4, 3, 1, 2, 4, 4, 3, 1, 2, 1, 3, 4, 2],
'rating': [100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100]})
# Pre-compute variables that don't depend on ratings
df['N'] = df.groupby('match_id')['name'].transform('count')
df['total_comb'] = ((df['N']*(df['N']-1))/2)
df['S'] = (df['N'] - df['rank']) / df['total_comb']
# Initialize win_prob and updated_rating
df['win_prob'] = np.zeros(len(df))
df['updated_rating'] = df['rating']
df['prev_rating'] = df['rating']
grouped = df.groupby('match_id', sort=True)
dfa = pd.DataFrame() #Final results will be stored here
last_names = []
#Loop through the match_ids from m1 to m2, m3. Note you can sort them when use 'groupby'
for name, dfg in grouped:
dfm = dfg.copy()
# Update the 'updated_rating' coming from last match_id
if len(last_names) > 0:
dfm.drop(columns=['updated_rating'], inplace=True)
df_last = dfa.loc[dfa['match_id'].isin(last_names),['name', 'updated_rating']]
df_last.drop_duplicates(subset=['name'], keep='last', inplace=True)
dfm = dfm.merge(df_last, left_on='name', right_on='name', how='left')
dfm['prev_rating'] = np.where(np.isnan(dfm['updated_rating']), dfm['rating'], dfm['updated_rating'])
# Compute current 'updated_rating'
win_prob = []
for index, row in dfm.iterrows():
prob = np.sum(1.0/(1+10**((row['prev_rating'] - dfm['prev_rating'])/400)))-0.5 #subtract 0.5 to account for self
win_prob.append(prob)
dfm['win_prob'] = win_prob
dfm['normalized_win_prob'] = dfm['win_prob']/dfm['total_comb']
dfm['updated_rating'] = round(dfm['prev_rating'] + (20 * (dfm['S'] - dfm['normalized_win_prob'])), 1)
last_names.append(name)
dfa = pd.concat([dfa, dfm], sort=True)
dfa
The output:
N S match_id name normalized_win_prob prev_rating rank rating total_comb updated_rating win_prob
4 0.333333333 m1 peter 0.25 100 2 100 6 101.7 1.5
4 0.166666667 m1 mike 0.25 100 3 100 6 98.3 1.5
4 0.5 m1 jeff 0.25 100 1 100 6 105 1.5
4 0 m1 john 0.25 100 4 100 6 95 1.5
4 0.166666667 m2 alex 0.251606926 100 3 100 6 98.3 1.509641559
4 0.5 m2 joe 0.251606926 100 1 100 6 105 1.509641559
4 0.333333333 m2 jeff 0.24681015 105 2 100 6 106.7 1.480860898
4 0 m2 peter 0.249975997 101.7 4 100 6 96.7 1.499855985
4 0 m3 alex 0.251630798 98.3 4 100 6 93.3 1.509784788
4 0.166666667 m3 peter 0.253165649 96.7 3 100 6 95 1.518993896
4 0.5 m3 joe 0.245203608 105 1 100 6 110.1 1.47122165
4 0.333333333 m3 tom 0.249999944 100 2 100 6 101.7 1.499999666
4 0.5 m4 mike 0.249232493 98.3 1 100 6 103.3 1.495394959
4 0.166666667 m4 john 0.252398303 95 3 100 6 93.3 1.514389819
4 0 m4 tom 0.2459709 101.7 4 100 6 96.8 1.475825403
4 0.333333333 m4 peter 0.252398303 95 2 100 6 96.6 1.514389819