Search code examples
pythonpandasdataframe

Trying to rearrange multiple columns in a dataframe based on ranking row values


I'm working on a matching company names and I have a dataframe that returns output in the format below.

The table has an original name and for each original name, there could be N number of matches. For each match, there are 3 columns, match_name_0, score_0, match_index_0 and so on up to match_name_N.

I'm trying to figure out a way to return a new dataframe that sorts the columns after the original_name by the highest match scores. Essentially, if match_score_2 was the highest then match_score_0 followed by match_score_1 the columns would be

original_score, match_name_2, match_score_2, match_index_2, match_name_0, match_score_0, match_index_0, match_name_2, match_score_2, match_index_2,

In the event of a tie, the leftmost match should be ranked higher. I should note that sometimes they will be in the correct order but 30-40% of the times, they are not.

index original_name match_name_0 score_0 match_index_0 match_name_1 score_1 match_index_1 match_name_2 score_2 match_index_2 match_name_3 score_3 match_index_3 match_name_4 score_4 match_index_4
0 aberdeen asset management plc aberdeen asset management sa 100 2114 aberdeen asset management plc esop 100 2128 aberdeen asset management inc 100 2123 aberdeen asset management spain 71.18779356 2132 aberdeen asset management ireland 69.50514818 2125
2 agi partners llc agi partners llc 100 5274 agi partners llc 100 5273 agr partners llc 57.51100704 5378 aci partners llc 53.45090217 3097 avi partners llc 53.45090217 17630
3 alberta investment management corporation alberta investment management corporation 100 6754 alberta investment management corporation pension arm 100 6755 anchor investment management corporation 17.50748486 10682 cbc investment management corporation 11.79760839 36951 harvest investment management corporation 31.70316571 85547

Solution

  • I am assuming you want to impose the ordering of matches first by score and then by match_number individually for each original_name.

    Wide datasets are usually difficult to deal with, including this case. I suggest to reshape to a long dataset, where you can easily impose your required ordering by

    sort_values(by=['original_name','score','match_number'], ascending=[True,False,True])
    

    Finally, you can reshape it back to a wide dataset.

    import pandas as pd
    from io import StringIO
        
    # sample data
    df = """  
    original_name,match_name_0,score_0,match_index_0,match_name_1,score_1,match_index_1,match_name_2,score_2,match_index_2,match_name_3,score_3,match_index_3,match_name_4,score_4,match_index_4
    aberdeen asset management plc,aberdeen asset management sa,100,2114,aberdeen asset management plc esop,100,2128,aberdeen asset management inc,100,2123,aberdeen asset management spain,71.18779356,2132,aberdeen asset management ireland,69.50514818,2125
    agi partners llc,agi partners llc,100,5274,agi partners llc,100,5273,agr partners llc,57.51100704,5378,aci partners llc,53.45090217,3097,avi partners llc,53.45090217,17630
    alberta investment management corporation,alberta investment management corporation,100,6754,alberta investment management corporation pension arm,100,6755,anchor investment management corporation,17.50748486,10682,cbc investment management corporation,11.79760839,36951,harvest investment management corporation,31.70316571,85547
    """
    df= pd.read_csv(StringIO(df.strip()), sep=',', engine='python')
    
    # wide to long
    result = pd.wide_to_long(df, ['match_name','score','match_index'], i='original_name', j='match_number', sep='_').reset_index()
    
    # sort matches as per requirement
    result = result.sort_values(by=['original_name','score','match_number'], ascending=[True,False,True])
    
    # overwrite ranking imposed by previous sort
    # this ensures that the order is maintained once it is
    # reshaped back to a wide dataset
    result['match_number'] = result.groupby('original_name').cumcount()
    
    # reshape long to wide
    result = result.set_index(['original_name','match_number']).unstack()
    
    # tidy up to match expected result
    result = result.swaplevel(axis=1).sort_index(axis=1)
    result = result.reindex(['match_name','score','match_index'], axis=1, level=1)
    result.columns = [f'{col[1]}_{col[0]}' for col in result.columns]
    

    As a result, for example, previous match 4 of alberta investment management corporation is now match 2 (based on score). The order of matches 3 and 4 for agi partners llc remain the same because they have the same score.