apache-spark apache-kafka spark-streaming apache-spark-sql apache-spark-1.4

How to load history data when starting Spark Streaming process, and calculate running aggregations

I have some sales-related JSON data in my ElasticSearch cluster, and I would like to use Spark Streaming (using Spark 1.4.1) to dynamically aggregate incoming sales events from my eCommerce website via Kafka, to have a current view to the user's total sales (in terms of revenue and products).

What's not really clear to me from the docs I read is how I can load the history data from ElasticSearch upon the start of the Spark application, and to calculate for example the overall revenue per user (based on the history, and the incoming sales from Kafka).

I have the following (working) code to connect to my Kafka instance and receive the JSON documents:

import kafka.serializer.StringDecoder
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.SQLContext

object ReadFromKafka {
  def main(args: Array[String]) {

    val checkpointDirectory = "/tmp"
    val conf = new SparkConf().setAppName("Read Kafka JSONs").setMaster("local[2]")
    val topicsSet = Array("tracking").toSet

    val sc = new SparkContext(conf)
    val ssc = new StreamingContext(sc, Seconds(10))

    // Create direct kafka stream with brokers and topics
    val kafkaParams = Map[String, String]("metadata.broker.list" -> "localhost:9092")
    val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
      ssc, kafkaParams, topicsSet)

    //Iterate
    messages.foreachRDD { rdd =>

      //If data is present, continue
      if (rdd.count() > 0) {

        //Create SQLContect and parse JSON
        val sqlContext = new SQLContext(sc)
        val trackingEvents = sqlContext.read.json(rdd.values)

        //Sample aggregation of incoming data
        trackingEvents.groupBy("type").count().show()

      }

    }

    // Start the computation
    ssc.start()
    ssc.awaitTermination()
  }
}

I know that there's a plugin for ElasticSearch (https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html#spark-read), but it's not really clear to me how to integrate the read upon startup, and the streaming calculation process to aggregate the history data with the streaming data.

Help is much appreaciated! Thanks in advance.

Solution

RDDs are immutable, so after they are created you cannot add data to them, for example updating the revenue with new events.

What you can do is union the existing data with the new events to create a new RDD, which you can then use as the current total. For example...

var currentTotal: RDD[(Key, Value)] = ... //read from ElasticSearch
messages.foreachRDD { rdd =>
    currentTotal = currentTotal.union(rdd)
}

In this case we make currentTotal a var since it will be replaced by the reference to the new RDD when it gets unioned with the incoming data.

After the union you may want to perform some further operations such as reducing the values which belong to the same Key, but you get the picture.

If you use this technique note that the lineage of your RDDs will grow, as each newly created RDD will reference its parent. This can cause a stack overflow style lineage problem. To fix this you can call checkpoint() on the RDD periodically.