Search code examples
scalaapache-sparkgoogle-bigqueryspark-streaminghadoop-yarn

Spark on YARN and spark-bigquery connector


I have developed a Scala Spark application for streaming data directly into Google BigQuery, using the spark-bigquery connector by Spotify.

Locally it works correctly, I have configured my application as described here https://github.com/spotify/spark-bigquery

val ssc = new StreamingContext(sc, Seconds(120))
val sqlContext = new SQLContext(sc)
sqlContext.setGcpJsonKeyFile("/opt/keyfile.json")
sqlContext.setBigQueryProjectId("projectid")
sqlContext.setBigQueryGcsBucket("gcsbucketname")
sqlContext.setBigQueryDatasetLocation("US")

but when I submit the application on my Spark on YARN cluster the job fails looking for GOOGLE_APPLICATION_CREDENTIALS environment variable...

The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials.

I set the variable as OS env var for root user to the .json file containing the credentials required, but it still fails.

I have also tried with the following line

System.setProperty("GOOGLE_APPLICATION_CREDENTIALS", "/opt/keyfile.json")

without success.

Any idea on what I'm missing?

Thank you,

Leonardo


Solution

  • the documentation suggests: "Environment variables need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName] property in your conf/spark-defaults.conf file. Environment variables that are set in spark-env.sh will not be reflected in the YARN Application Master process in cluster mode."