Search code examples
javaapache-sparkgoogle-cloud-dataproc

Apache Spark job runs locally but throwing null pointer on Google Cloud Cluster


I have an application for Apache Spark that I have been until now running/testing on local machine using command:

spark --class "main.SomeMainClass" --master local[4] jarfile.jar

And everything runs alright however when I submit this very same job to Google Cloud Dataproc Engine it throws NullPointerException as follows:

Caused by: java.lang.NullPointerException
at geneticClasses.FitnessCalculator.calculateFitness(FitnessCalculator.java:30)
at geneticClasses.StringIndividualMapReduce.calculateFitness(StringIndividualMapReduce.java:91)
at mapreduce.Mapper.lambda$mapCalculateFitness$3d84c37$1(Mapper.java:30)
at org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1018)
at 
.
.
.

This error is thrown from worker node as it's occuring during map phase. What's the difference between local mode and actual cluster except that local mode just simulates worker nodes as separate threads? FitnessCalculator sits on driver node and all methods are static. Do I need to make it Serializable so it can be shipped to the worker node together with other code?

Thank you


Solution

  • You say that FitnessCalculator only has static methods and that it works in local mode. My guess is that you have some static object (initialized to null) that you set in the driver and then attempt to use within a Spark task at FitnessCalculator.java:30. That won't work unfortunately.

    Changes to static fields aren't distributed to Spark workers. The reason it works in local mode is that the workers are running within the same JVM (Java Virtual Machine) as the driver, so they coincidentally have access to the new value.