Search code examples
pysparkdatabricksazure-databricks

Chaining joins in Pyspark


I am trying to join multiple dataframes in PySpark by one chained operation. The join key column name is the same in all of them. The code snippet:

columns_summed = [i for i in df_summed.columns if i != "buildingBlock_id"]
columns_concat = [i for i in df_concat.columns if i != "buildingBlock_id"]
columns_indicator = [i for i in df_indicator_fields.columns if i != "buildingBlock_id"]
columns_takeone = [i for i in df_takeone.columns if i != "buildingBlock_id"]
columns_minmax = [i for i in df_minmax.columns if i != "buildingBlock_id"]

df_all_joined = (df_summed.alias("df1").join(df_concat,df_summed.buildingBlock_id == df_concat.buildingBlock_id, "left")
                          .join(df_indicator_fields,df_summed.buildingBlock_id == df_indicator_fields.buildingBlock_id, "left")
                          .join(df_takeone,df_summed.buildingBlock_id == df_takeone.buildingBlock_id, "left")
                          .join(df_minmax,df_summed.buildingBlock_id == df_minmax.buildingBlock_id, "left")
                          .select("df1.buildingBlock_id", *columns_summed
                          , *columns_concat
                          , *columns_indicator
                          , *columns_takeone
                          , *columns_minmax
                          )
                          )

Now, when I am trying to display the joined dataframe using:

display(df_all_joined)

I'm getting the following error:

AnalysisException: Reference 'df1.buildingBlock_id' is ambiguous, could be: df1.buildingBlock_id, df1.buildingBlock_id.

Why am I getting this error even though I specified where the key column should come from?


Solution

  • You should specify the join columns as an array of strings:

    .join(df_concat,['buildingBlock_id'], "left")
    

    If the columns that you are joining on have the same name, this makes sure to drop one of them. In the case of the left join, it drops the column from df_concat.

    If you don't do that, you end up with both columns in the joined data frame, thus creating this "Ambiguous" excception.