Search code examples
apache-sparkpyspark

Joing dataframes with different column names for join


I am trying pySpark join

city_df.join(regions.select("region_id"),
                                       on=[city_df.region_name ==
                          regions_df.name], how = 'inner' )

And getting the error that "name" is not found among city_df columns. I don't understand why, if i clearly say in city_df it should be not "name" but "region_name"

The error:

pyspark.errors.exceptions.captured.AnalysisException: Resolved attribute(s) name#32 missing from type#0,region_name#1,version#6, region_id#31L in operator !Join Inner, (region_name#1 = name#32).;
!Join Inner, (region_name#1 = name#32)

So what am i doing wrong? How to merge correctly on column with different names?


Solution

  • In your first line you're selecting just the region_id column. Because of this, there is no name column to reference in the join.

    Also, you appear to be referencing two different region dataframes in your join: regions and regions_df. You should be referencing the same dataframe. Something like this should work:

    city_df.join(
        regions_df,
        on=[city_df.region_name == regions_df.name],
        how='inner'
    )
    

    Then drop any unwanted columns after.