I am creating a Spark RDD by loading data from Elasticsearch using the elasticsearch-hadoop connector in python (importing pyspark) as:
es_cluster_read_conf = {
"es.nodes" : "XXX",
"es.port" : "XXX",
"es.resource" : "XXX"
}
es_cluster_rdd = sc.newAPIHadoopRDD(
inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf=es_cluster_read_conf)
Now, if I only have these 2 commands in my file and run it, on the Spark Web UI for Application Details, I see on job as: take at SerDeUtil.scala:201
I have 2 questions now:
1) I was under the impression that in Spark RDDs are computed lazily i.e, if no action is applied, there would not be any job launched. In the above scenario, I am not applying any action, yet I see a job as being run on the web UI.
2) If this is a job, what does this "take"
operation actually mean? Does this mean that the data is actually loaded from my ElasticSearch node and passed to Spark node ? I understand some jobs as being listed as collect, count, etc because these are valid actions in Spark. However, even after doing extensive research, I still couldn't figure out the semantics of this take
operation.
I was under the impression that in Spark RDDs are computed lazily i.e, if no action is applied, there would not be any job launched. I
This is more or less true although there a few exceptions out there when action can be triggered by a secondary task like creating partitioner, data conversions between JVM and guest languages. It is even more complicated when you work with high level Dataset
API and Dataframes
.
If this is a job, what does this "take" operation actually mean? Does this mean that the data is actually loaded from my ElasticSearch node and passed to Spark node?
It is a job and some amount of data is actually fetched from the source. It is required to determine serializer for the key-value pairs.