Search code examples
scalaapache-sparkdistributed-computing

difference between rdd.collect().toMap to rdd.collectAsMap()?


Is there any performance impact when I use collectAsMap on my RDD instead of rdd.collect().toMap ?

I have a key value rdd and I want to convert to HashMap as far I know collect() is not efficient on large data sets as it runs on driver can I use collectAsMap instead is there any performance impact ?

Original:

val QuoteHashMap=QuoteRDD.collect().toMap 
val QuoteRDDData=QuoteHashMap.values.toSeq 
val QuoteRDDSet=sc.parallelize(QuoteRDDData.map(x => x.toString.replace("(","").replace(")",""))) 
QuoteRDDSet.saveAsTextFile(Quotepath) 

Change:

val QuoteHashMap=QuoteRDD.collectAsMap() 
val QuoteRDDData=QuoteHashMap.values.toSeq 
val QuoteRDDSet=sc.parallelize(QuoteRDDData.map(x => x.toString.replace("(","").replace(")",""))) 
QuoteRDDSet.saveAsTextFile(Quotepath)

Solution

  • The implementation of collectAsMap is the following

    def collectAsMap(): Map[K, V] = self.withScope {
        val data = self.collect()
        val map = new mutable.HashMap[K, V]
        map.sizeHint(data.length)
        data.foreach { pair => map.put(pair._1, pair._2) }
        map
      }
    

    Thus, there is no performance difference between collect and collectAsMap, because collectAsMap calls under the hood also collect.