Search code examples
javaapache-sparkspark-streaming

Spark Streaming: Rdd.Count() not returning a valid number


In my application I have two JavaDStreams which contain some data. I am attempting to count the number of rows in each JavaDStream however the result I am receiving in the log isn't a number but rather a completely different object that its outputting to the log. What am I doing wrong here?

Code:

//map score result set to tweets
JavaDStream<Tuple5<Long, String, Float, Float, String>> result =
        scoredTweets.map(new ScoreTweetsFunction());

//get extra elements
JavaDStream<Tuple7<Long, String, String, String, String, String, String>> extra_elements =
        json.map(new GetExtraElements());

//join elements with score result
System.out.println("Number of Rows in extra elements RDD: " + extra_elements.count());
System.out.println("Number of Rows in result RDD: " + result.count());

Output from Log:

Number of Rows in extra elements RDD: org.apache.spark.streaming.api.java.JavaDStream@73358a55
Number of Rows in result RDD: org.apache.spark.streaming.api.java.JavaDStream@242aa3b2

Solution

  • DStream is not a RDD but a continuous and potentially infinite sequence of RDDs. Because of that it cannot be counted and it is not how count method is intended to work.

    Instead it transforms existing stream into another stream where each RDD

    has a single element generated by counting each RDD of this DStream

    If you want to perform some action on individual RDDs you should use foreachRDD.