I'm working on a matching company names and I have a dataframe that returns output in the format below.
The table has an original name and for each original name, there could be N number of matches. For each match, there are 3 columns, match_name_0, score_0, match_index_0 and so on up to match_name_N.
I'm trying to figure out a way to return a new dataframe that sorts the columns after the original_name by the highest match scores. Essentially, if match_score_2 was the highest then match_score_0 followed by match_score_1 the columns would be
original_score, match_name_2, match_score_2, match_index_2, match_name_0, match_score_0, match_index_0, match_name_2, match_score_2, match_index_2,
In the event of a tie, the leftmost match should be ranked higher. I should note that sometimes they will be in the correct order but 30-40% of the times, they are not.
index | original_name | match_name_0 | score_0 | match_index_0 | match_name_1 | score_1 | match_index_1 | match_name_2 | score_2 | match_index_2 | match_name_3 | score_3 | match_index_3 | match_name_4 | score_4 | match_index_4 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | aberdeen asset management plc | aberdeen asset management sa | 100 | 2114 | aberdeen asset management plc esop | 100 | 2128 | aberdeen asset management inc | 100 | 2123 | aberdeen asset management spain | 71.18779356 | 2132 | aberdeen asset management ireland | 69.50514818 | 2125 |
2 | agi partners llc | agi partners llc | 100 | 5274 | agi partners llc | 100 | 5273 | agr partners llc | 57.51100704 | 5378 | aci partners llc | 53.45090217 | 3097 | avi partners llc | 53.45090217 | 17630 |
3 | alberta investment management corporation | alberta investment management corporation | 100 | 6754 | alberta investment management corporation pension arm | 100 | 6755 | anchor investment management corporation | 17.50748486 | 10682 | cbc investment management corporation | 11.79760839 | 36951 | harvest investment management corporation | 31.70316571 | 85547 |
I am assuming you want to impose the ordering of matches first by score
and then by match_number
individually for each original_name
.
Wide datasets are usually difficult to deal with, including this case. I suggest to reshape to a long dataset, where you can easily impose your required ordering by
sort_values(by=['original_name','score','match_number'], ascending=[True,False,True])
Finally, you can reshape it back to a wide dataset.
import pandas as pd
from io import StringIO
# sample data
df = """
original_name,match_name_0,score_0,match_index_0,match_name_1,score_1,match_index_1,match_name_2,score_2,match_index_2,match_name_3,score_3,match_index_3,match_name_4,score_4,match_index_4
aberdeen asset management plc,aberdeen asset management sa,100,2114,aberdeen asset management plc esop,100,2128,aberdeen asset management inc,100,2123,aberdeen asset management spain,71.18779356,2132,aberdeen asset management ireland,69.50514818,2125
agi partners llc,agi partners llc,100,5274,agi partners llc,100,5273,agr partners llc,57.51100704,5378,aci partners llc,53.45090217,3097,avi partners llc,53.45090217,17630
alberta investment management corporation,alberta investment management corporation,100,6754,alberta investment management corporation pension arm,100,6755,anchor investment management corporation,17.50748486,10682,cbc investment management corporation,11.79760839,36951,harvest investment management corporation,31.70316571,85547
"""
df= pd.read_csv(StringIO(df.strip()), sep=',', engine='python')
# wide to long
result = pd.wide_to_long(df, ['match_name','score','match_index'], i='original_name', j='match_number', sep='_').reset_index()
# sort matches as per requirement
result = result.sort_values(by=['original_name','score','match_number'], ascending=[True,False,True])
# overwrite ranking imposed by previous sort
# this ensures that the order is maintained once it is
# reshaped back to a wide dataset
result['match_number'] = result.groupby('original_name').cumcount()
# reshape long to wide
result = result.set_index(['original_name','match_number']).unstack()
# tidy up to match expected result
result = result.swaplevel(axis=1).sort_index(axis=1)
result = result.reindex(['match_name','score','match_index'], axis=1, level=1)
result.columns = [f'{col[1]}_{col[0]}' for col in result.columns]
As a result, for example, previous match 4 of alberta investment management corporation
is now match 2 (based on score). The order of matches 3 and 4 for agi partners llc
remain the same because they have the same score.