Search code examples
apache-sparkpysparkelasticsearch-hadoop

Spark Web UI "take at SerDeUtil.scala:201" interpretation


I am creating a Spark RDD by loading data from Elasticsearch using the elasticsearch-hadoop connector in python (importing pyspark) as:

es_cluster_read_conf = {
    "es.nodes" : "XXX",
    "es.port" : "XXX",
    "es.resource" : "XXX"
}

es_cluster_rdd = sc.newAPIHadoopRDD(
        inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
        keyClass="org.apache.hadoop.io.NullWritable", 
        valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", 
        conf=es_cluster_read_conf)

Now, if I only have these 2 commands in my file and run it, on the Spark Web UI for Application Details, I see on job as: take at SerDeUtil.scala:201

I have 2 questions now:

1) I was under the impression that in Spark RDDs are computed lazily i.e, if no action is applied, there would not be any job launched. In the above scenario, I am not applying any action, yet I see a job as being run on the web UI.

2) If this is a job, what does this "take" operation actually mean? Does this mean that the data is actually loaded from my ElasticSearch node and passed to Spark node ? I understand some jobs as being listed as collect, count, etc because these are valid actions in Spark. However, even after doing extensive research, I still couldn't figure out the semantics of this take operation.


Solution

  • I was under the impression that in Spark RDDs are computed lazily i.e, if no action is applied, there would not be any job launched. I

    This is more or less true although there a few exceptions out there when action can be triggered by a secondary task like creating partitioner, data conversions between JVM and guest languages. It is even more complicated when you work with high level Dataset API and Dataframes.

    If this is a job, what does this "take" operation actually mean? Does this mean that the data is actually loaded from my ElasticSearch node and passed to Spark node?

    It is a job and some amount of data is actually fetched from the source. It is required to determine serializer for the key-value pairs.