Search code examples
apache-sparkspark-streaming

How many RDDs does DStream generate for a batch interval?


Does one batch interval of data generate one and only one RDD in DStream regardless of how big is the quantity of the data?


Solution

  • Yes, there is exactly one RDD per batch interval, produced at every batch interval independent of number of records (that are included in the RDD -- there could be zero records inside).

    If there wasn't, and RDD creation was conditioned on the number of elements, you wouldn't have synchronous (micro-batching) streaming, but rather a form of asynchronous processing.