Search code examples
scalaapache-sparkmapreducerddflatmap

Efficiency of flatMap vs map followed by reduce in Spark


I have a text file sherlock.txt containing multiple lines of text. I load it in spark-shell using:

val textFile = sc.textFile("sherlock.txt")

My purpose is to count the number of words in the file. I came across two alternative ways to do the job.

First using flatMap:

textFile.flatMap(line => line.split(" ")).count()

Second using map followed by reduce:

textFile.map(line => line.split(" ").size).reduce((a, b) => a + b)

Both yield the same result correctly. I want to know the differences in time and space complexity of the above two alternative implementations, if indeed there is any ?

Does the scala interpreter convert both into the most efficient form ?


Solution

  • I will argue that the most idiomatic way to handle this would be to map and sum:

    textFile.map(_.split(" ").size).sum
    

    but the end of the day a total cost will be dominated by line.split(" ").

    You could probably do a little bit better by iterating over the string manually and counting consecutive whitespaces instead of building new Array but I doubt it is worth all the fuss in general.

    If you prefer a little bit deeper insight count is defined as:

    def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum  
    

    where Utils.getIteratorSize is pretty much a naive iteration over Iterator with a sum of ones and sum is equivalent to

    _.fold(0.0)(_ + _)