Search code examples
apache-sparkdirected-acyclic-graphsinternals

What is ExternalRDDScan in the DAG?


What is the meaning of ExternalRDDScan in the DAG?

The whole internet doesn't have an explanation for it.

enter image description here


Solution

  • Based on the source, ExternalRDDScan is a representation of converting existing RDD of arbitrary objects to a dataset of InternalRows, i.e. creating a DataFrame. Let's verify that our understanding is correct:

    scala> import spark.implicits._
    import spark.implicits._
    
    scala> val rdd = sc.parallelize(Array(1, 2, 3, 4, 5))
    rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:26
    
    scala> rdd.toDF().explain()
    == Physical Plan ==
    *(1) SerializeFromObject [input[0, int, false] AS value#2]
    +- Scan ExternalRDDScan[obj#1]