Search code examples
scalaapache-sparkrddword-count

When to use countByValue and when to use map().reduceByKey()


I am new to Spark and scala and working on a simple wordCount example.

So for that i am using countByValue as follows:

val words = lines.flatMap(x => x.split("\\W+")).map(x => x.toLowerCase())
val wordCount = words.countByValue();

which works fine.

And the same thing can be achieved like :

val words = lines.flatMap(x => x.split("\\W+")).map(x => x.toLowerCase())
val wordCounts = words.map(x => (x, 1)).reduceByKey((x, y) => x + y)
val sortedWords = wordCounts.map(x => (x._2, x._1)).sortByKey()

which also works fine.

Now, my question is when to use which methods? Which one is preferred over the other?


Solution

  • The example here - not words, but numbers:

    val n = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))
    val n2 = n.countByValue
    

    returns a local Map:

    n: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at command-3737881976236428:1
    n2: scala.collection.Map[Int,Long] = Map(5 -> 1, 1 -> 6, 6 -> 1, 2 -> 3, 7 -> 1, 3 -> 1, 8 -> 1, 4 -> 2)
    

    That is the key difference.

    If you want a Map out of the box, then this is the way to go.

    Also, the point is that the reduce is implied and cannot be influenced, nor need be provided as in reduceByKey.

    The reduceByKey has preference when data sizes are large. The Map is loaded into Driver memory in its entirety.