The definition of DStream from the documentation states,
Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream. Internally, a DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset.
The question is if it is represented as series of RDDs, can we make Stream of RDD and expect it to work similar to DStream?
It would be great if someone can help me to understand this with a code sample.
The question is if it is represented as series of RDDs, can we make Stream of RDD and expect it to work similar to DStream?
You're right. A DStream
is logically a series of RDD
s.
Spark Streaming is just to hide the process of creating Seq[RDD]
so it is not your job but the framework.
Moreover, Spark Streaming gives you a much nicer developer API so you can think of Seq[RDD]
as a DStream
, but rather than rdds.map(rdd => your code goes here)
you can simply dstream.map(t => your code goes here)
which is not that different except the types of rdd
and t
. You're simply one level below already when working with DStream
.