I am trying pySpark join
city_df.join(regions.select("region_id"),
on=[city_df.region_name ==
regions_df.name], how = 'inner' )
And getting the error that "name" is not found among city_df columns. I don't understand why, if i clearly say in city_df it should be not "name" but "region_name"
The error:
pyspark.errors.exceptions.captured.AnalysisException: Resolved attribute(s) name#32 missing from type#0,region_name#1,version#6, region_id#31L in operator !Join Inner, (region_name#1 = name#32).;
!Join Inner, (region_name#1 = name#32)
So what am i doing wrong? How to merge correctly on column with different names?
In your first line you're selecting just the region_id
column. Because of this, there is no name
column to reference in the join.
Also, you appear to be referencing two different region dataframes in your join: regions
and regions_df
. You should be referencing the same dataframe. Something like this should work:
city_df.join(
regions_df,
on=[city_df.region_name == regions_df.name],
how='inner'
)
Then drop any unwanted columns after.