scala apache-spark mapreduce rdd flatmap

Efficiency of flatMap vs map followed by reduce in Spark

I have a text file sherlock.txt containing multiple lines of text. I load it in spark-shell using:

val textFile = sc.textFile("sherlock.txt")

My purpose is to count the number of words in the file. I came across two alternative ways to do the job.

First using flatMap:

textFile.flatMap(line => line.split(" ")).count()

Second using map followed by reduce:

textFile.map(line => line.split(" ").size).reduce((a, b) => a + b)

Both yield the same result correctly. I want to know the differences in time and space complexity of the above two alternative implementations, if indeed there is any ?

Does the scala interpreter convert both into the most efficient form ?

Solution

I will argue that the most idiomatic way to handle this would be to map and sum:

textFile.map(_.split(" ").size).sum

but the end of the day a total cost will be dominated by line.split(" ").

You could probably do a little bit better by iterating over the string manually and counting consecutive whitespaces instead of building new Array but I doubt it is worth all the fuss in general.

If you prefer a little bit deeper insight count is defined as:

def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

where Utils.getIteratorSize is pretty much a naive iteration over Iterator with a sum of ones and sum is equivalent to

_.fold(0.0)(_ + _)