Search code examples
javaapache-sparkspark-streaming

How Spark Stream Processing sets up computation?


I have a basic question with respect to how spark stream processing works.

In here, there is one block which says: "Note that when these lines are executed, Spark Streaming only sets up the computation it will perform after it is started, and no real processing has started yet. To start the processing after all the transformations have been setup, we finally call start method."

I am not able to digest the above description. Ideally the computation would belong to a function and somehow ask "ssc.start()" to execute this function in loop. But here, the core computation is executed first and then the streaming context is started. How does this make sense?

public final class JavaNetworkWordCount {
  private static final Pattern SPACE = Pattern.compile(" ");

  public static void main(String[] args) throws Exception {
    if (args.length < 2) {
      System.err.println("Usage: JavaNetworkWordCount <hostname> <port>");
      System.exit(1);
    }

    StreamingExamples.setStreamingLogLevels();

    // Create the context with a 1 second batch size
    SparkConf sparkConf = new SparkConf().setAppName("JavaNetworkWordCount");
    JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1));

    // Create a JavaReceiverInputDStream on target ip:port and count the
    // words in input stream of \n delimited text (eg. generated by 'nc')
    // Note that no duplication in storage level only for running locally.
    // Replication necessary in distributed scenario for fault tolerance.

    /* below four statements represents the core computation */ 
    JavaReceiverInputDStream<String> lines = ssc.socketTextStream(
            args[0], Integer.parseInt(args[1]), StorageLevels.MEMORY_AND_DISK_SER);

    JavaDStream<String> words = lines.flatMap(x -> Arrays.asList(SPACE.split(x)).iterator());

    JavaPairDStream<String, Integer> wordCounts = words.mapToPair(s -> new Tuple2<>(s, 1))
        .reduceByKey((i1, i2) -> i1 + i2);

    wordCounts.print();

    /* starting actual stream processing */
    ssc.start();
    ssc.awaitTermination();
  }
}

Solution

  • But here, the core computation is executed first and then the streaming context is started. How does this make sense?

    Because "core computation" is not a "computation". It is description what suppose to be done, if StreamingContext is ever started and if any data is any received.

    If it is still not clear - it is what not how and internals are not leaked to the user code.