Search code examples
javascalaapache-sparkamazon-emr

Spark SQL fails because "Constant pool has grown past JVM limit of 0xFFFF"


I am running this code on EMR 4.6.0 + Spark 1.6.1 :

val sqlContext = SQLContext.getOrCreate(sc)
val inputRDD = sqlContext.read.json(input)

try {
    inputRDD.filter("`first_field` is not null OR `second_field` is not null").toJSON.coalesce(10).saveAsTextFile(output)
    logger.info("DONE!")
} catch {
    case e : Throwable => logger.error("ERROR" + e.getMessage)
}

In the last stage of saveAsTextFile, it fails with this error:

16/07/15 08:27:45 ERROR codegen.GenerateUnsafeProjection: failed to compile: org.codehaus.janino.JaninoRuntimeException: Constant pool has grown past JVM limit of 0xFFFF
/* 001 */
/* 002 */ public java.lang.Object generate(org.apache.spark.sql.catalyst.expressions.Expression[] exprs) {
/* 003 */   return new SpecificUnsafeProjection(exprs);
/* 004 */ }
(...)

What could be the reason? Thanks


Solution

  • Solved this problem by dropping all the unused column in the Dataframe, or just filter columns you actually need.

    Turnes out Spark Dataframe can not handle super wide schemas. There is no specific number of columns where Spark might break with “Constant pool has grown past JVM limit of 0xFFFF” - it depends on kind of query, but reducing number of columns can help to workaround this issue.

    The underlying root cause is in JVM's 64kb for generated Java classes - see also Andrew's answer.