Search code examples
scalaapache-sparkdataframehivecontext

column is not a member of org.apache.spark.sql.DataFrame


I am new to spark and I am trying to join two tables present in hive from Scala code:

import org.apache.spark.sql._
import sqlContext.implicits._

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

val csp = hiveContext.sql("select * from csp")
val ref = hiveContext.sql("select * from ref_file")

val csp_ref_join = csp.join(ref, csp.model_id == ref.imodel_id , "LEFT_OUTER")

however for the above join I got error :

<console>:54: error: value model_id is not a member of org.apache.spark.sql.DataFrame
         val csp_ref_join = csp.join(ref, csp.model_id == ref.imodel_id , "LEFT_OUTER")

Is it a right way to join the hive tables if not what went wrong?

one more question ... joins on hive tables in Scala vs same joins in hive which one is better approach considering performance? is it right way to do it in Scala with hiveContext?

thanks in advance!!


Solution

  • Since you use Scala you cannot use dot syntax. Also it is === not ==

    csp.join(ref_file, csp("model_id") === ref_file("icmv_model_id"), "leftouter")
    

    or (if there are no name conflicts):

    csp.join(ref_file, $"model_id" === $"icmv_model_id", "leftouter")
    

    or (under the same conditions as above):

    import org.apache.spark.sql.functions.col
    
    csp.join(ref_file, col("model_id") === col("icmv_model_id"), "leftouter")