Search code examples
pythondataframejoinpyspark

Difference between two syntaxes of join (pyspark)


I have a question about the difference two syntaxes of join in PySpark,

For example, I have these two dataframes

DF1

ID NAME
1 OLIVER
2 MICHAEL
3 JUSTIN

DF2

ID address
1 BRAZIL
1 FRANCE
2 PORTUGAL
2 ARGENTINA
3 MEXICO

Usually I use:

df1.join(df2, (col('df1.ID') == col('df2.ID')), 'left')

But sometimes I see people using:

df1.join(df2, 'ID', 'left')

So, my question is, what is the difference between this two codes?

I'm trying to understand the difference between the first and second syntax of join.


Solution

  • All of the following mean the same thing:

    Join Desc Type of second arg on
    df1.join(df2, 'ID') a string for the join column name str
    df1.join(df2, ['ID']) a list of column names List[str]
    df1.join(df2, col('df1.ID') == col('df2.ID')) a join expression (Column) pyspark.sql.column.Column
    df1.join(df2, df1.ID == df2.ID) a join expression (Column) pyspark.sql.column.Column

    If you had a composite key of two columns (ck1, ck2) instead of single ID. All of the following mean the same thing:

    Join Desc Type of second arg on
    df1.join(df2, ['ck1', 'ck1']) a list of column names List[str]
    df1.join(df2, [col('df1.pk1') == col('df2.pk1'), col('df1.pk2') == col('df2.pk2')]) list of join expressions List[pyspark.sql.column.Column]
    df1.join(df2, [df1.ck1 == df2.ck1, df1.ck2 == df2.ck2]) list of join expressions List[pyspark.sql.column.Column]

    Ignoring the how='left' as it's same


    As documentation of DataFrame.join() says:

    DataFrame.join(other: pyspark.sql.dataframe.DataFrame, 
                   on: Union[
                           str,
                           List[str],
                           pyspark.sql.column.Column,
                           List[pyspark.sql.column.Column],
                           None
                        ] = None,
                   how: Optional[str] = None
                ) → pyspark.sql.dataframe.DataFrame[source]
    

    on: str, list or Column, optional

    • a string for the join column name,
    • a list of column names,
    • a join expression (Column), or
    • a list of Columns.

    If on is a string or a list of strings indicating the name of the join column(s), the column(s) must exist on both sides, and >this performs an equi-join.