I have 2 dataframes in PySpark,
df1 = spark.createDataFrame([
("s1", "artist1"),
("s2", "artist2"),
("s3", "artist3"),
],
['song_id', 'artist'])
df1.show()
df2 = spark.createDataFrame([
("s1", "2"),
("s1", "3"),
("s4", "4"),
("s4", "5")
],
['song_id', 'duration'])
df2.show()
Output:
+-------+-------+
|song_id| artist|
+-------+-------+
| s1|artist1|
| s2|artist2|
| s3|artist3|
+-------+-------+
+-------+-----+
|song_id|col_2|
+-------+-----+
| s1| hmm|
| s1| hmmm|
| s4| acha|
| s4| ohoo|
+-------+-----+
I apply right_outer and left join on these 2 dataframes and they both seem to give me the same result-
df1.join(df2, on="song_id", how="right_outer").show()
df2.join(df1, on="song_id", how="left").show()
Output:
+-------+-------+--------+
|song_id| artist|duration|
+-------+-------+--------+
| s1|artist1| 2|
| s1|artist1| 3|
| s4| null| 4|
| s4| null| 5|
+-------+-------+--------+
+-------+--------+-------+
|song_id|duration| artist|
+-------+--------+-------+
| s1| 2|artist1|
| s1| 3|artist1|
| s4| 4| null|
| s4| 5| null|
+-------+--------+-------+
I am not sure how to use these 2 joins effectively. What is the difference between these 2 joins?
The left and right joins gives result based on the order of table respective to join keyword.
Left/leftouter/left_outer joins are all same and shows the whole left table and the matching records of the right table.
Right/rightouter/right_outer joins are all same and shows the whole right table and the matching records of the left table.
In the code
df1.join(df2, on="song_id", how="right_outer").show()
df1 is the left table(dataframe) and df2 is the right table and join type is right_outer, hence it shows all the rows of df2 and matching rows of the df1.
Similarly in
df2.join(df1, on="song_id", how="left").show()
df2 is the left table and df1 is the right table and the join type is left, so it shows all records of df2 and matching records of df1.
Hence both code shows the same result.
df1.join(df2, on="song_id", how="right_outer").show()
df1.join(df2, on="song_id", how="left").show()
In the above code, I have placed df1 as left table in both queries. And here is the result:-
song_id | artist | duration |
---|---|---|
s1 | artist1 | 2 |
s1 | artist1 | 3 |
s4 | null | 4 |
s4 | null | 5 |
song_id | artist | duration |
---|---|---|
s1 | artist1 | 2 |
s1 | artist1 | 3 |
s2 | artist2 | null |
s3 | artist3 | null |
You can use this for reference.