Search code examples
scalaapache-spark

Why does Spark fail with java.lang.OutOfMemoryError: GC overhead limit exceeded?


I'm trying to implement a Hadoop Map/Reduce job that worked fine before in Spark. The Spark app definition is the following:

val data = spark.textFile(file, 2).cache()
val result = data
  .map(//some pre-processing)
  .map(docWeightPar => (docWeightPar(0),docWeightPar(1))))
  .flatMap(line => MyFunctions.combine(line))
  .reduceByKey( _ + _)

Where MyFunctions.combine is

def combine(tuples: Array[(String, String)]): IndexedSeq[(String,Double)] =
  for (i <- 0 to tuples.length - 2;
       j <- 1 to tuples.length - 1
  ) yield (toKey(tuples(i)._1,tuples(j)._1),tuples(i)._2.toDouble * tuples(j)._2.toDouble)

The combine function produces lots of map keys if the list used for input is big and this is where the exceptions is thrown.

In the Hadoop Map Reduce setting I didn't have problems because this is the point where the combine function yields was the point Hadoop wrote the map pairs to disk. Spark seems to keep all in memory until it explodes with a java.lang.OutOfMemoryError: GC overhead limit exceeded.

I am probably doing something really basic wrong but I couldn't find any pointers on how to come forward from this, I would like to know how I can avoid this. Since I am a total noob at Scala and Spark I am not sure if the problem is from one or from the other, or both. I am currently trying to run this program in my own laptop, and it works for inputs where the length of the tuples array is not very long.


Solution

  • Adjusting the memory is probably a good way to go, as has already been suggested, because this is an expensive operation that scales in an ugly way. But maybe some code changes will help.

    You could take a different approach in your combine function that avoids if statements by using the combinations function. I'd also convert the second element of the tuples to doubles before the combination operation:

    tuples.
    
        // Convert to doubles only once
        map{ x=>
            (x._1, x._2.toDouble)
        }.
    
        // Take all pairwise combinations. Though this function
        // will not give self-pairs, which it looks like you might need
        combinations(2).
    
        // Your operation
        map{ x=>
            (toKey(x{0}._1, x{1}._1), x{0}._2*x{1}._2)
        }
    

    This will give an iterator, which you can use downstream or, if you want, convert to list (or something) with toList.