apache-spark google-cloud-platform google-cloud-dataproc

Spark job out of RAM (java.lang.OutOfMemoryError), even though there's plenty. xmx too low?

I'm getting java.lang.OutOfMemoryError with my Spark job, even though only 20% of the total memory is in use.

I've tried several configurations:

1x n1-highmem-16 + 2x n1-highmem-8
3x n1-highmem-8

My dataset consist of 1.8M records, read from a local json file on the master node. The entire dataset in json format is 7GB. The job I'm trying to execute involves a simple computation followed by a reduceByKey. Nothing extraordinary. The job runs fine on my single home computer with only 32GB ram (xmx28g), although it requires some caching to disk.

The job is submitted through spark-submit, locally on the server (SSH).

Stack trace and Spark config can be viewed here: https://pastee.org/sgda

The code

val rdd = sc.parallelize(Json.load()) // load everything
  .map(fooTransform)                  // apply some trivial transformation
  .flatMap(_.bar.toSeq)               // flatten results
  .map(c => (c, 1))                   // count 
  .reduceByKey(_ + _)
  .sortBy(_._2)
log.v(rdd.collect.map(toString).mkString("\n"))

Solution

The root of the problem is that you should try to offload more I/O to the distributed tasks instead of shipping it back and forth between the driver program and the worker tasks. While it may not be obvious at times which calls are driver-local and which ones describe a distributed action, rules of thumb include avoiding parallelize and collect unless you absolutely need all of the data in one place. The amounts of data you can Json.load() and the parallelize will max out at whatever largest machine type is possible, whereas using calls like sc.textFile theoretically scale to hundreds of TBs or even PBs without problem.

The short-term fix in your case would be to try passing spark-submit --conf spark.driver.memory=40g ... or something in that range. Dataproc defaults allocate less than a quarter of the machine to driver memory because commonly the cluster must support running multiple concurrent jobs, and also needs to leave enough memory on the master node for the HDFS namenode and the YARN resource manager.

Longer term you might want to experiment with how you can load the JSON data as an RDD directly, instead of loading it in a single driver and using parallelize to distribute it, since this way you can dramatically speed up the input reading time by having tasks load the data in parallel (and also getting rid of the warning Stage 0 contains a task of very large size which is likely related to the shipping of large data from your driver to worker tasks).

Similarly, instead of collect and then finishing things up on the driver program, you can do things like sc.saveAsTextFile to save in a distributed manner, without ever bottlenecking through a single place.

Reading the input as sc.textFile would assume line-separated JSON, and you can parse inside some map task, or you can try using sqlContext.read.json. For debugging purposes, it's often enough instead of using collect() to just call take(10) to take a peek at some records without shipping all of it to the driver.