In my application I have two JavaDStreams which contain some data. I am attempting to count the number of rows in each JavaDStream however the result I am receiving in the log isn't a number but rather a completely different object that its outputting to the log. What am I doing wrong here?
Code:
//map score result set to tweets
JavaDStream<Tuple5<Long, String, Float, Float, String>> result =
scoredTweets.map(new ScoreTweetsFunction());
//get extra elements
JavaDStream<Tuple7<Long, String, String, String, String, String, String>> extra_elements =
json.map(new GetExtraElements());
//join elements with score result
System.out.println("Number of Rows in extra elements RDD: " + extra_elements.count());
System.out.println("Number of Rows in result RDD: " + result.count());
Output from Log:
Number of Rows in extra elements RDD: org.apache.spark.streaming.api.java.JavaDStream@73358a55
Number of Rows in result RDD: org.apache.spark.streaming.api.java.JavaDStream@242aa3b2
DStream
is not a RDD
but a continuous and potentially infinite sequence of RDDs. Because of that it cannot be counted and it is not how count
method is intended to work.
Instead it transforms existing stream into another stream where each RDD
has a single element generated by counting each RDD of this DStream
If you want to perform some action on individual RDDs you should use foreachRDD
.