Search code examples
pythonpandaspandasql

Outer Join Pandas Dataframe


I am trying to outer join (on df1) two pandas dataframe. Below are the sample data frames:

df1:
Index   Team 1   Team 2   Team1_Score    Team2_Score
 0       A        B        25              56
 1       B        C        30              55
 2       D        E        35              75

df2:
Index   Team 1   Team 2   Team1_Avg     Team2_Avg
 0       A        B        5              15
 1       G        F        10             25
 2       C        B        15             35

dfcombined
Index   Team 1   Team 2   Team1_Score    Team2_Score    Team2_Avg     Team1_Avg
 0       A        B        25              56           5             15
 1       B        C        30              55           35            15
 2       D        E        35              75        

I was trying to use the pandasql module however I am not sure how to handle the case of joining index 1 in df1 and index 2 at df2 as the order of teams is reversed. Through pandasql module, I am not sure how to switch the Team Avg values in the combined data frame if the order of team is reverse.

I would appreciate any help on this.


Solution

  • Setup -

    df1
    
          Team 1 Team 2  Team1_Score  Team2_Score
    Index                                        
    0          A      B           25           56
    1          B      C           30           55
    2          D      E           35           75
    
    df2
    
          Team 1 Team 2  Team1_Avg  Team2_Avg
    Index                                    
    0          A      B          5         15
    1          F      G         25         10
    2          B      C         35         15
    

    First, we'll need to sort the Team * columns, and accordingly sort the Team*_Score columns in the same way. We'll use argsort to do so.

    i = np.arange(len(df1))[:, None]
    j = np.argsort(df1[['Team 1', 'Team 2']], axis=1).values
    
    df1[['Team 1', 'Team 2']] = df1[['Team 1', 'Team 2']].values[i, j]
    df1[['Team1_Score', 'Team2_Score']] = df1[['Team1_Score', 'Team2_Score']].values[i, j]
    

    Now, repeat the same process for df2, with Team * and Team*_Avg.

    j = np.argsort(df2[['Team 1', 'Team 2']], axis=1).values
    
    df2[['Team 1', 'Team 2']] = df2[['Team 1', 'Team 2']].values[i, j]
    df2[['Team1_Avg', 'Team2_Avg']] = df2[['Team1_Avg', 'Team2_Avg']].values[i, j]
    

    Now, perform a left outer merge -

    df1.merge(df2, on=['Team 1', 'Team 2'], how='left')
    
      Team 1 Team 2  Team1_Score  Team2_Score Team1_Avg Team2_Avg
    0      A      B           25           56         5        15
    1      B      C           30           55        35        15
    2      D      E           35           75