Search code examples
apache-sparkgoogle-cloud-sqlgoogle-cloud-dataproc

Connecting to Cloud SQL from Dataproc using the Cloud SQL Proxy


I am trying to access Cloud SQL from Dataproc via Cloud SQL Proxy (without using Hive) and using Scala 2.11.12. There are similar questions here in SO but none have an answer to the problem I'm facing.

I've managed to connect Dataproc to Cloud SQL putting spark.master in "local" mode but I get an exception when using the "yarn" mode, so I'm definitely missing something.

The app crashes when doing:

SparkSession
  .builder()
  .appName("SomeSparkJob")
  .getOrCreate() 

The exception I get when the job is submitted and it does the .getOrCreate() above:

Exception in thread "main" java.lang.NoSuchFieldError: ASCII
        at org.apache.hadoop.yarn.api.records.impl.pb.ApplicationSubmissionContextPBImpl.checkTags(ApplicationSubmissionContextPBImpl.java:287)
        at org.apache.hadoop.yarn.api.records.impl.pb.ApplicationSubmissionContextPBImpl.setApplicationTags(ApplicationSubmissionContextPBImpl.java:302)
        at org.apache.spark.deploy.yarn.Client$$anonfun$createApplicationSubmissionContext$2.apply(Client.scala:245)
        at org.apache.spark.deploy.yarn.Client$$anonfun$createApplicationSubmissionContext$2.apply(Client.scala:244)
        at scala.Option.foreach(Option.scala:257)
        at org.apache.spark.deploy.yarn.Client.createApplicationSubmissionContext(Client.scala:244)
        at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:180)
        at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
        at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:183)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:501)
        at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
        at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:935)
        at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:926)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
        at dev.ancor.somedataprocsparkjob.SomeSparkJob$.main(SomeSparkJob.scala:13)
        at dev.ancor.somedataprocsparkjob.SomeSparkJob.main(SomeSparkJob.scala)

The question is: Why do I get that exception when running on "yarn" mode and how do I fix it? Thank you!


Solution

  • As Gabe Weiss and David Rabinowitz confirmed, we can put the Dataproc clusters and Cloud SQL in a VPC network and just use the private IP. No need to use the Cloud SQL Proxy.