Search code examples
dataframejoinpyspark

PySpark Which Join Should be Selected


I have two pyspark dataframe:

Dataframe 1

ID Account
1 A
2 A
3 B
4 B

Dataframe 2

Country Account
USA A
UK A
USA B
UK B

I would like to join two datafram on Account and each ID in dataframe 1 generate a row in new dataframe like:

ID Country Account
1 USA A
2 USA A
1 UK A
2 UK A
3 USA B
4 USA B
3 UK B
4 UK B

Which join type should I choose?


Solution

  • Left, right, inner and outer joins all work. It depends on what you want to do with rows that cannot be joined between the tables:

    Remove them: inner join

    Keep all rows: outer join

    Keep only all rows from df1: left join

    Keep only all rows from df2: right join