I have two pyspark dataframe:
Dataframe 1
ID | Account |
---|---|
1 | A |
2 | A |
3 | B |
4 | B |
Dataframe 2
Country | Account |
---|---|
USA | A |
UK | A |
USA | B |
UK | B |
I would like to join two datafram on Account and each ID in dataframe 1 generate a row in new dataframe like:
ID | Country | Account |
---|---|---|
1 | USA | A |
2 | USA | A |
1 | UK | A |
2 | UK | A |
3 | USA | B |
4 | USA | B |
3 | UK | B |
4 | UK | B |
Which join type should I choose?
Left, right, inner and outer joins all work. It depends on what you want to do with rows that cannot be joined between the tables:
Remove them: inner join
Keep all rows: outer join
Keep only all rows from df1: left join
Keep only all rows from df2: right join