What is the meaning of ExternalRDDScan in the DAG?
The whole internet doesn't have an explanation for it.
Based on the source, ExternalRDDScan
is a representation of converting existing RDD of arbitrary objects to a dataset of InternalRow
s, i.e. creating a DataFrame
. Let's verify that our understanding is correct:
scala> import spark.implicits._
import spark.implicits._
scala> val rdd = sc.parallelize(Array(1, 2, 3, 4, 5))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:26
scala> rdd.toDF().explain()
== Physical Plan ==
*(1) SerializeFromObject [input[0, int, false] AS value#2]
+- Scan ExternalRDDScan[obj#1]