Search code examples
apache-sparkapache-spark-sqlcassandradatastax-enterprisespark-cassandra-connector

spark join causing column id ambiguity error


I have following dataframes:

accumulated_results_df
 |-- company_id: string (nullable = true)
 |-- max_dd: string (nullable = true)
 |-- min_dd: string (nullable = true)
 |-- count: string (nullable = true)
 |-- mean: string (nullable = true)

computed_df
 |-- company_id: string (nullable = true)
 |-- min_dd: date (nullable = true)
 |-- max_dd: date (nullable = true)
 |-- mean: double (nullable = true)
 |-- count: long (nullable = false)

Trying to do a join using spark-sql as below

 val resultDf = accumulated_results_df.as("a").join(computed_df.as("c"), 
                             ( $"a.company_id" === $"c.company_id" ) && ( $"c.min_dd" > $"a.max_dd" ), "left")

Giving error as :

org.apache.spark.sql.AnalysisException: Reference 'company_id' is ambiguous, could be: a.company_id, c.company_id.;

What am i doing wrong here and How to fix this ?


Solution

  • Should work using the col function to reference correctly the alias dataframes and columns

    val resultDf = (accumulated_results_df.as("a")
           .join(
               computed_df.as("c"),
               (col("a.company_id") === col("c.company_id")) && (col("c.min_dd") > col("a.max_dd")), 
               "left"
            )