Search code examples
pysparkleft-joinouter-join

Is left join and right_outer join the same if the tables are positioned differently, in pyspark?


I have 2 dataframes in PySpark,

df1 = spark.createDataFrame([
    ("s1", "artist1"),
    ("s2", "artist2"),
    ("s3", "artist3"),
    ],
    ['song_id', 'artist'])


df1.show()

df2 = spark.createDataFrame([
    ("s1", "2"),
    ("s1", "3"),
    ("s4", "4"),
    ("s4", "5")
    ],
    ['song_id', 'duration'])

df2.show()

Output:

+-------+-------+
|song_id| artist|
+-------+-------+
|     s1|artist1|
|     s2|artist2|
|     s3|artist3|
+-------+-------+



+-------+-----+
|song_id|col_2|
+-------+-----+
|     s1|  hmm|
|     s1| hmmm|
|     s4| acha|
|     s4| ohoo|
+-------+-----+

I apply right_outer and left join on these 2 dataframes and they both seem to give me the same result-

df1.join(df2, on="song_id", how="right_outer").show()
df2.join(df1, on="song_id", how="left").show()

Output:

 +-------+-------+--------+
|song_id| artist|duration|
+-------+-------+--------+
|     s1|artist1|       2|
|     s1|artist1|       3|
|     s4|   null|       4|
|     s4|   null|       5|
+-------+-------+--------+

+-------+--------+-------+
|song_id|duration| artist|
+-------+--------+-------+
|     s1|       2|artist1|
|     s1|       3|artist1|
|     s4|       4|   null|
|     s4|       5|   null|
+-------+--------+-------+

I am not sure how to use these 2 joins effectively. What is the difference between these 2 joins?


Solution

  • The left and right joins gives result based on the order of table respective to join keyword.

    Left/leftouter/left_outer joins are all same and shows the whole left table and the matching records of the right table.

    Right/rightouter/right_outer joins are all same and shows the whole right table and the matching records of the left table.

    In the code

    df1.join(df2, on="song_id", how="right_outer").show()
    

    df1 is the left table(dataframe) and df2 is the right table and join type is right_outer, hence it shows all the rows of df2 and matching rows of the df1.

    Similarly in

    df2.join(df1, on="song_id", how="left").show()
    

    df2 is the left table and df1 is the right table and the join type is left, so it shows all records of df2 and matching records of df1.

    Hence both code shows the same result.

    df1.join(df2, on="song_id", how="right_outer").show()
    df1.join(df2, on="song_id", how="left").show()
    

    In the above code, I have placed df1 as left table in both queries. And here is the result:-

    song_id artist duration
    s1 artist1 2
    s1 artist1 3
    s4 null 4
    s4 null 5
    song_id artist duration
    s1 artist1 2
    s1 artist1 3
    s2 artist2 null
    s3 artist3 null

    https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.join.html#pyspark.sql.DataFrame.join

    You can use this for reference.