I have a question about the difference two syntaxes of join in PySpark,
For example, I have these two dataframes
DF1
ID | NAME |
---|---|
1 | OLIVER |
2 | MICHAEL |
3 | JUSTIN |
DF2
ID | address |
---|---|
1 | BRAZIL |
1 | FRANCE |
2 | PORTUGAL |
2 | ARGENTINA |
3 | MEXICO |
Usually I use:
df1.join(df2, (col('df1.ID') == col('df2.ID')), 'left')
But sometimes I see people using:
df1.join(df2, 'ID', 'left')
So, my question is, what is the difference between this two codes?
I'm trying to understand the difference between the first and second syntax of join.
All of the following mean the same thing:
Join | Desc | Type of second arg on |
---|---|---|
df1.join(df2, 'ID') |
a string for the join column name | str |
df1.join(df2, ['ID']) |
a list of column names | List[str] |
df1.join(df2, col('df1.ID') == col('df2.ID')) |
a join expression (Column) | pyspark.sql.column.Column |
df1.join(df2, df1.ID == df2.ID) |
a join expression (Column) | pyspark.sql.column.Column |
If you had a composite key of two columns (ck1, ck2
) instead of single ID
. All of the following mean the same thing:
Join | Desc | Type of second arg on |
---|---|---|
df1.join(df2, ['ck1', 'ck1']) |
a list of column names | List[str] |
df1.join(df2, [col('df1.pk1') == col('df2.pk1'), col('df1.pk2') == col('df2.pk2')]) |
list of join expressions | List[pyspark.sql.column.Column] |
df1.join(df2, [df1.ck1 == df2.ck1, df1.ck2 == df2.ck2]) |
list of join expressions | List[pyspark.sql.column.Column] |
Ignoring the how='left'
as it's same
As documentation of DataFrame.join()
says:
DataFrame.join(other: pyspark.sql.dataframe.DataFrame,
on: Union[
str,
List[str],
pyspark.sql.column.Column,
List[pyspark.sql.column.Column],
None
] = None,
how: Optional[str] = None
) → pyspark.sql.dataframe.DataFrame[source]
on
:str
,list
orColumn
, optional
- a string for the join column name,
- a list of column names,
- a join expression (Column), or
- a list of Columns.
If on is a string or a list of strings indicating the name of the join column(s), the column(s) must exist on both sides, and >this performs an equi-join.