Search code examples
scalaapache-sparkextension-methodsimplicit

How import spark.sqlContext.implicits._ works in scala?


I'm new in Scala

Here's what I'm trying to understand

This code snippet gives me RDD[Int], not give option to use toDF

var input = spark.sparkContext.parallelize(List(1,2,3,4,5,6,7,8,9))

But when I import import spark.sqlContext.implicits._, it gives me an option to use toDF

import spark.sqlContext.implicits._
var input = spark.sparkContext.parallelize(List(1,2,3,4,5,6,7,8,9)).toDF

So I looked into the source code, implicits is present in SQLContext class as object. I'm not able to understand, how come RDD instance is able to call toDF after import ?

Can anyone help me to understand ?

update

found below code snippet in SQLContext class

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala

  object implicits extends SQLImplicits with Serializable {
    protected override def _sqlContext: SQLContext = self
  }

Solution

  • toDF is an extension method. With the import you bring necessary implicits to the scope.

    For example Int doesn't have method foo

    1.foo() // doesn't compile
    

    But if you define an extension method and import implicit

    object implicits {
      implicit class IntOps(i: Int) {
        def foo() = println("foo")
      }
    }
    
    import implicits._
    1.foo() // compiles
    

    The compiler transforms 1.foo() into new IntOps(1).foo().

    Similarly,

    object implicits extends SQLImplicits ...
    
    abstract class SQLImplicits ... {
      ...
    
      implicit def rddToDatasetHolder[T : Encoder](rdd: RDD[T]): DatasetHolder[T] = {
        DatasetHolder(_sqlContext.createDataset(rdd))
      }
    
      implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]): DatasetHolder[T] = {
        DatasetHolder(_sqlContext.createDataset(s))
      }
    }
    
    case class DatasetHolder[T] private[sql](private val ds: Dataset[T]) {
    
      def toDS(): Dataset[T] = ds
    
      def toDF(): DataFrame = ds.toDF()
    
      def toDF(colNames: String*): DataFrame = ds.toDF(colNames : _*)
    }
    

    import spark.sqlContext.implicits._ transforms spark.sparkContext.parallelize(List(1,2,3,4,5,6,7,8,9)).toDF into rddToDatasetHolder(spark.sparkContext.parallelize...).toDF i.e. DatasetHolder(_sqlContext.createDataset(spark.sparkContext.parallelize...)).toDF.

    You can read about implicits, extension methods in Scala

    Understanding implicit in Scala

    Where does Scala look for implicits?

    Understand Scala Implicit classes

    https://docs.scala-lang.org/overviews/core/implicit-classes.html

    https://docs.scala-lang.org/scala3/book/ca-extension-methods.html

    https://docs.scala-lang.org/scala3/reference/contextual/extension-methods.html

    How extend a class is diff from implicit class?


    About spark.implicits._

    Importing spark.implicits._ in scala

    What is imported with spark.implicits._?

    import implicit conversions without instance of SparkSession

    Workaround for importing spark implicits everywhere

    Why is spark.implicits._ is embedded just before converting any rdd to ds and not as regular imports?