Search code examples
scalaparallel-processingparallel.foreachscala-2.13

For Scala 2.13, what is the fastest method for updating a LongMap, HashMap, or TrieMap with millions of updates?


Goal

I have a mutable Map[Long, Long] with millions of entries. I need to make many iterations of updates with millions of updates. I would like to do this as fast as possible.

Background

Currently, the fastest method is to use a single threaded mutable.LongMap[Long]. This type is optimized for Long types as the key.

Other map types appear to be slower -- but I may have implemented them incorrectly as I was trying to do the updates concurrently and/or in parallel without success. It is possible that updating a map in parallel is not actually occurring or is not possible in Scala.

In order of fastest to slowest:

  1. LongMap[Long] (from above)
  2. TrieMap[Long, Long]
  3. ParTrieMap[Long, Long]
  4. HashMap[Long, Long]
  5. ParHashMap[Long, Long]
  6. ParMap[Long, Long]

It is OK if a faster method is not mutable, but I do not think this will be the case. A mutable map is probably best for this use case.

Code to generate test data and time the test

import java.util.Calendar
import scala.collection.mutable

object DictSpeedTest2 {

  //helper constants
  val million: Long = 1000000
  val billion: Long = million * 1000

  //config
  val garbageCollectionWait = 3
  val numEntries: Long = million * 10 //may need to increase JVM memory with something like: -Xmx32g
  val maxValue: Long = billion * million // max Long = 9223372036854775807L
                                         // this is       1000000000000000L

  def main(args: Array[String]): Unit = {
    //generate random data; initial entries in a; updates in b
    val a = genData(numEntries, maxValue, seed = 1000)
    val b = genData(numEntries, maxValue, seed = 9999)

    //initialization
    val dict = new mutable.LongMap[Long]()
    a.foreach(x => dict += (x._1 -> x._2))

    //run and time test
    println("start test: " + Calendar.getInstance().getTime)
    val start = System.currentTimeMillis
    b.foreach(x => dict += (x._1 -> x._2)) //updates
    val end = System.currentTimeMillis

    //print runtime
    val durationInSeconds = (end - start).toFloat / 1000 + "s"
    println("end test:  " + Calendar.getInstance().getTime + " -- " + durationInSeconds)
  }

  def genData(n: Long, max: Long, seed: Long): Array[(Long, Long)] = {
    val r = scala.util.Random
    r.setSeed(seed) //deterministic generation of arrays
    val a = new Array[(Long, Long)](n.toInt)
    a.map(_ => (r.nextInt(), r.nextInt()) )
  }
}

Current timings

LongMap[Long] with the above code completes in the following times on my 2018 MacBook Pro:

  • ~3.5 seconds with numEntries = 10 million
  • ~100 seconds with numEntries = 100 million

Solution

  • If you are not limited to use only Scala/Java maps than for exceptional performance you can peek 3rd party libraries that have maps specialized for Long/Long key/value pairs.

    Here is not so outdated overview of such kind of libraries with benchmark results for Int/Int pairs.