When creating our Dataproc Spark cluster we pass
--properties spark:spark.jars.packages=mysql:mysql-connector-java:6.0.6
to the gcloud dataproc clusters create
command.
This is for our PySpark scripts to save to CloudSQL
Apparently on creation this doesn't do anything, but on the first spark-submit
this will try to resolve this dependency.
Technically it seems resolve and download the necessary jar file, but the first task on the cluster will fail because of warnings emitted from spark-submit
Exception in thread "main" java.lang.RuntimeException: [download failed: mysql#mysql-connector-java;6.0.6!mysql-connector-java.jar]
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1177)
at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:298)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
The complete output is:
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
mysql#mysql-connector-java added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found mysql#mysql-connector-java;6.0.6 in central
downloading https://repo1.maven.org/maven2/mysql/mysql-connector-java/6.0.6/mysql-connector-java-6.0.6.jar ...
:: resolution report :: resolve 527ms :: artifacts dl 214ms
:: modules in use:
mysql#mysql-connector-java;6.0.6 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 1 | 1 | 1 | 0 || 1 | 0 |
---------------------------------------------------------------------
:: problems summary ::
:::: WARNINGS
[FAILED ] mysql#mysql-connector-java;6.0.6!mysql-connector-java.jar: Downloaded file size doesn't match expected Content Length for https://repo1.maven.org/maven2/mysql/mysql-connector-java/6.0.6/mysql-connector-java-6.0.6.jar. Please retry. (212ms)
[FAILED ] mysql#mysql-connector-java;6.0.6!mysql-connector-java.jar: Downloaded file size doesn't match expected Content Length for https://repo1.maven.org/maven2/mysql/mysql-connector-java/6.0.6/mysql-connector-java-6.0.6.jar. Please retry. (212ms)
==== central: tried
https://repo1.maven.org/maven2/mysql/mysql-connector-java/6.0.6/mysql-connector-java-6.0.6.jar
::::::::::::::::::::::::::::::::::::::::::::::
:: FAILED DOWNLOADS ::
:: ^ see resolution messages for details ^ ::
::::::::::::::::::::::::::::::::::::::::::::::
:: mysql#mysql-connector-java;6.0.6!mysql-connector-java.jar
::::::::::::::::::::::::::::::::::::::::::::::
However subsequent tasks on the cluster show this output
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
mysql#mysql-connector-java added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found mysql#mysql-connector-java;6.0.6 in central
:: resolution report :: resolve 224ms :: artifacts dl 5ms
:: modules in use:
mysql#mysql-connector-java;6.0.6 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 1 | 0 | 0 | 0 || 1 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 1 already retrieved (0kB/7ms)
So my question is:
How consistently can you reproduce this? My best theory after attempting to reproduce with different cluster settings is this is possibly an overloaded server that returns a 5xx error.
As far as workarounds go:
1) Download the jar from Maven Central and pass it with --jars
option when submitting the job. If you frequently create new clusters than staging this file on the cluster via initialization actions is the way to go.
2) Provide an alternate ivy settings file via spark.jars.ivySettings
property that points at Google Maven Central mirror (this should reduce/eliminate odds of 5xx errors)
See this article: https://www.infoq.com/news/2015/11/maven-central-at-google