Search code examples
pythonpandasdataframeintersection

How to merge two columns by the intersection of the elements in each col?


Imagine I have a dataframe like this: With lists of elements in a single string.

data = {'Col1': ["apple, banana, orange", "dog, cat", "python, java, c++"],
        'Col2': ["banana, lemon, blueberry", "bird, cat", "R, fortran"]
       }
df = pd.DataFrame(data)
df

How can I create a Col3 with the intersection of elements in Col1 and Col2

Expected output:

data = {'Col1': ["apple, banana, orange", "dog, cat", "python, java, c++"],
        'Col2': ["banana, lemon, blueberry", "bird, cat", "R, fortran"],
        'Col3': ["banana", "cat", NA]
       }
df = pd.DataFrame(data)
df

Solution

  • Using a list comprehension and set intersection:

    df['Col3'] = [', '.join(set(a.split(', ')) & set(b.split(', ')))
                  for a,b in zip(df['Col1'], df['Col2'])]
    

    Output:

                        Col1                      Col2    Col3
    0  apple, banana, orange  banana, lemon, blueberry  banana
    1               dog, cat                 bird, cat     cat
    2      python, java, c++                R, fortran        
    

    If you want NAs on empty intersections:

    df['Col3'] = [x if (x:=', '.join(set(a.split(', ')) & set(b.split(', '))))
                  else pd.NA
                  for a,b in zip(df['Col1'], df['Col2'])]
    

    Output:

                        Col1                      Col2    Col3
    0  apple, banana, orange  banana, lemon, blueberry  banana
    1               dog, cat                 bird, cat     cat
    2      python, java, c++                R, fortran    <NA>