Search code examples
apache-sparkstreamingdstream

spark streaming DStream map vs foreachRDD, which is more efficient for transformation


Just for transformation, map and foreachRDD can achieve the same goal, but which one is more efficient? And why?

for example,for a DStream[Int]:

val newDs1=Ds.map(x=> x+1)
val newDs2=Ds.foreachRDD (rdd=>rdd.map(x=> x+1))

I know foreachRDD will operate on the RDD directly, but map seams to transform DStream to RDD first(not sure), thus foreachRDD seams more efficient than map. However, map is a Transformations Operation while foreachRDD is a Output Operations. Thus, map should be more efficient than foreachRDD while doing transformation. Anybody knows which one is right and why? Thanks for any reply.

Add one more comparison:

val newDS3=Ds.transform (rdd=>rdd.map(x=> x+1))

which is more efficient for transformation?


Solution

  • You could answer this question yourself if you checked the types. foreachRDD is Unit so what you have is:

     val newDs2: Unit = Ds.foreachRDD (rdd=>rdd.map(x=> x+1))
    

    You not only don't have DStream[_], but internal map is never executed (it is lazy).

    Following two:

    Ds.map(x=> x+1)
    Ds.transform (rdd=>rdd.map(x=> x+1))
    

    are identical in terms of execution, so it doesn't make sense to use the latter one, which is unnecessarily verbose.