Search code examples
pythonscalaapache-sparkjoininner-join

How to join a column of lists in one dataframe with a column of strings in another dataframe?


I have two dataframes. The first one (let's call it A) has a column (let's call it 'col1') whose elements are lists of strings. The other one (let's call it B) has a column (let's call it 'col2') whose elements are strings. I want to do a join between these two dataframes where B.col2 is in the list in A.col1. This is one-to-many join.

Also, I need the solution to be scalable since I wanna join two dataframes with hundreds of thousands of rows.

I have tried concatenating the values in A.col1 and creating a new column (let's call it 'col3') and joining with this condition: A.col3.contains(B.col2). However, my understanding is that this condition triggers a cartesian product between the two dataframes which I cannot afford considering the size of the dataframes.

def joinIds(IdList):
  return "__".join(IdList)
joinIds_udf = udf(joinIds)

pnr_corr = pnr_corr.withColumn('joinedIds', joinIds_udf(pnr_corr.pnrCorrelations.correlationPnrSchedule.scheduleIds)

pnr_corr_skd = pnr_corr.join(skd, pnr_corr.joinedIds.contains(skd.id), how='inner')

This is a sample of the join that I have in mind:

dataframe A:
listColumn
["a","b","c"]
["a","b"]
["d","e"]

dataframe B:
valueColumn
a
b
d

output:
listColumn      valueColumn
["a","b","c"]   a
["a","b","c"]   b
["a","b"]       a
["a","b"]       b
["d","e"]       d

Solution

  • I don't know if there is an efficient way to do it, but this gives the correct output:

    import pandas as pd
    from itertools import chain
    
    df1 = pd.Series([["a","b","c"],["a","b"],["d","e"]])
    df2 = pd.Series(["a","b","d"])
    
    result = [ [ [el2,list1] for el2 in df2.values if el2 in list1 ] 
                             for list1 in df1.values ]
    result_flat = list(chain(*result))
    
    result_df = pd.DataFrame(result_flat)
    

    You get:

    In [26]: result_df
    Out[26]:
       0          1
    0  a  [a, b, c]
    1  b  [a, b, c]
    2  a     [a, b]
    3  b     [a, b]
    4  d     [d, e]
    

    Another approach is to use the new explode() method from pandas>=0.25 and merge like this:

    import pandas as pd
    
    df1 = pd.DataFrame({'col1': [["a","b","c"],["a","b"],["d","e"]]})
    df2 = pd.DataFrame({'col2': ["a","b","d"]})
    
    df1_flat = df1.col1.explode().reset_index()
    df_merged = pd.merge(df1_flat,df2,left_on='col1',right_on='col2')
    
    df_merged['col2'] = df1.loc[df_merged['index']].values
    df_merged.drop('index',axis=1, inplace=True)
    

    This gives the same result:

      col1       col2
    0    a  [a, b, c]
    1    a     [a, b]
    2    b  [a, b, c]
    3    b     [a, b]
    4    d     [d, e]