Search code examples
apache-sparkgoogle-cloud-dataprochocon

Add conf file to classpath in Google Dataproc


We're building a Spark application in Scala with a HOCON configuration, the config is called application.conf.

If I add the application.conf to my jar file and start a job on Google Dataproc, it works correctly:

gcloud dataproc jobs submit spark \
  --cluster <clustername> \
  --jar=gs://<bucketname>/<filename>.jar \
  --region=<myregion> \
  -- \
  <some options>

I don't want to bundle the application.conf with my jar file but provide it separately, which I can't get working.

Tried different things, i.e.

  1. Specifying the application.conf with --jars=gs://<bucketname>/application.conf (which should work according to this answer)
  2. Using --files=gs://<bucketname>/application.conf
  3. Same as 1. + 2. with the application conf in /tmp/ on the Master instance of the cluster, then specifying the local file with file:///tmp/application.conf
  4. Defining extraClassPath for spark using --properties=spark.driver.extraClassPath=gs://<bucketname>/application.conf (and for executors)

With all these options I get an error, it can't find the key in the config:

Exception in thread "main" com.typesafe.config.ConfigException$Missing: system properties: No configuration setting found for key 'xyz'

This error usually means that there's an error in the HOCON config (key xyz is not defined in HOCON) or that the application.conf is not in the classpath. Since the exact same config is working when inside my jar file, I assume it's the latter.

Are there any other options to put the application.conf on the classpath?


Solution

  • If --jars doesn't work as suggested in this answer, you can try init action. First upload your config to GCS, then write an init action to download it to the VMs, putting it to a folder in the classpath or update spark-env.sh to include the path to the config.