Search code examples
google-cloud-dataprocgoogle-cloud-platformspark-submit

GCP Dataproc spark.jar.packages issue downloading dependencies


When creating our Dataproc Spark cluster we pass --properties spark:spark.jars.packages=mysql:mysql-connector-java:6.0.6 to the gcloud dataproc clusters create command.

This is for our PySpark scripts to save to CloudSQL

Apparently on creation this doesn't do anything, but on the first spark-submit this will try to resolve this dependency.

Technically it seems resolve and download the necessary jar file, but the first task on the cluster will fail because of warnings emitted from spark-submit

Exception in thread "main" java.lang.RuntimeException: [download failed: mysql#mysql-connector-java;6.0.6!mysql-connector-java.jar]
    at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1177)
    at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:298)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

The complete output is:

Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
mysql#mysql-connector-java added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
    confs: [default]
    found mysql#mysql-connector-java;6.0.6 in central
downloading https://repo1.maven.org/maven2/mysql/mysql-connector-java/6.0.6/mysql-connector-java-6.0.6.jar ...
:: resolution report :: resolve 527ms :: artifacts dl 214ms
    :: modules in use:
    mysql#mysql-connector-java;6.0.6 from central in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   1   |   1   |   1   |   0   ||   1   |   0   |
    ---------------------------------------------------------------------

:: problems summary ::
:::: WARNINGS
        [FAILED     ] mysql#mysql-connector-java;6.0.6!mysql-connector-java.jar: Downloaded file size doesn't match expected Content Length for https://repo1.maven.org/maven2/mysql/mysql-connector-java/6.0.6/mysql-connector-java-6.0.6.jar. Please retry. (212ms)

        [FAILED     ] mysql#mysql-connector-java;6.0.6!mysql-connector-java.jar: Downloaded file size doesn't match expected Content Length for https://repo1.maven.org/maven2/mysql/mysql-connector-java/6.0.6/mysql-connector-java-6.0.6.jar. Please retry. (212ms)

    ==== central: tried

      https://repo1.maven.org/maven2/mysql/mysql-connector-java/6.0.6/mysql-connector-java-6.0.6.jar

        ::::::::::::::::::::::::::::::::::::::::::::::

        ::              FAILED DOWNLOADS            ::

        :: ^ see resolution messages for details  ^ ::

        ::::::::::::::::::::::::::::::::::::::::::::::

        :: mysql#mysql-connector-java;6.0.6!mysql-connector-java.jar

        ::::::::::::::::::::::::::::::::::::::::::::::

However subsequent tasks on the cluster show this output

Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
mysql#mysql-connector-java added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
    confs: [default]
    found mysql#mysql-connector-java;6.0.6 in central
:: resolution report :: resolve 224ms :: artifacts dl 5ms
    :: modules in use:
    mysql#mysql-connector-java;6.0.6 from central in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   1   |   0   |   0   |   0   ||   1   |   0   |
    ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
    confs: [default]
    0 artifacts copied, 1 already retrieved (0kB/7ms)

So my question is:

  1. What is the cause & can this be fixed by the good people at GCP?
  2. Is there a temporary workaround, besides running a dummy task that is allowed to fail at inception of the cluster?

Solution

  • How consistently can you reproduce this? My best theory after attempting to reproduce with different cluster settings is this is possibly an overloaded server that returns a 5xx error.

    As far as workarounds go:

    1) Download the jar from Maven Central and pass it with --jars option when submitting the job. If you frequently create new clusters than staging this file on the cluster via initialization actions is the way to go.

    2) Provide an alternate ivy settings file via spark.jars.ivySettings property that points at Google Maven Central mirror (this should reduce/eliminate odds of 5xx errors)

    See this article: https://www.infoq.com/news/2015/11/maven-central-at-google