Search code examples
pythonpandasdataframejoin

How to do join pandas dataframes based on multiple columns?


I have 2 dataframes like below:

level                        title
Level 0                      Effective
Level 1                      Evaluation
Level 1                      Ice Breaker
Level 1                      Fire
Level 2                      Introduction
Level 2                      Understanding
Level 3                      Connect 

and

level                        title
Level 0                      Effective
Level 1                      Evaluation
Level 1                      Comedy
Level 2                      Introduction
Level 2                      Understanding
Level 4                      Connect 

I want to join them based on both columns, matching entries should be in front of each other while missing entries should have Null in them. The required output would be like this:

level               title               level               title
Level 0             Effective           Level 0             Effective
Level 1             Evaluation          Level 1             Evaluation
Level 1             Ice Breaker 
                                        Level 1             Comedy
Level 1             Fire    
Level 2             Introduction        Level 2             Introduction
Level 2             Understanding       Level 2             Understanding
Level 3             Connect             
                                        Level 4             Connect
                                        

Solution

  • You can use a merge but first differentiate the column names on the left/right (e.g. with add_suffix):

    out = df1.merge(df2.add_suffix('_'), how='outer',
                    left_on=['level', 'title'],
                    right_on=['level_', 'title_'])
    

    Output:

         level          title   level_         title_
    0  Level 0      Effective  Level 0      Effective
    1  Level 1     Evaluation  Level 1     Evaluation
    2  Level 1    Ice Breaker      NaN            NaN
    3  Level 1           Fire      NaN            NaN
    4  Level 2   Introduction  Level 2   Introduction
    5  Level 2  Understanding  Level 2  Understanding
    6  Level 3        Connect      NaN            NaN
    7      NaN            NaN  Level 1         Comedy
    8      NaN            NaN  Level 4        Connect
    

    Another approach if you want to be able to sort on the merged keys:

    out = (df1.merge(df2, how='outer',
                     left_on=[df1['level'], df1['title']],
                     right_on=['level', 'title'])
              .sort_values(by=['level'])
              #.drop(columns=['level', 'title']) # uncomment to drop merged keys
          )
    

    Output:

         level          title  level_x        title_x  level_y        title_y
    0  Level 0      Effective  Level 0      Effective  Level 0      Effective
    1  Level 1     Evaluation  Level 1     Evaluation  Level 1     Evaluation
    2  Level 1    Ice Breaker  Level 1    Ice Breaker      NaN            NaN
    3  Level 1           Fire  Level 1           Fire      NaN            NaN
    7  Level 1         Comedy      NaN            NaN  Level 1         Comedy
    4  Level 2   Introduction  Level 2   Introduction  Level 2   Introduction
    5  Level 2  Understanding  Level 2  Understanding  Level 2  Understanding
    6  Level 3        Connect  Level 3        Connect      NaN            NaN
    8  Level 4        Connect      NaN            NaN  Level 4        Connect