Search code examples
apache-sparkhadoophbaseamazon-emr

Spark recommends listing Spark and Hadoop dependencies as provided in the docs, is this strictly required?


In the Spark documentation, it states:

If your code depends on other projects, you will need to package them alongside your application in order to distribute the code to a Spark cluster. To do this, create an assembly jar (or “uber” jar) containing your code and its dependencies. Both sbt and Maven have assembly plugins. When creating assembly jars, list Spark and Hadoop as provided dependencies; these need not be bundled since they are provided by the cluster manager at runtime.

Having all dependencies directly declared and packaged into the deployed uberjar would be much more reliable, especially given how sensitive Hadoop tends to be with class compatibility issues between versions of dependencies. Even in EMR/AWS, I believe their specialized Spark, Hadoop, and HBase dependencies are available as maven dependencies. See the hadoop-aws Getting Started docs

Is it strictly necessary to leave the Spark and Hadoop dependencies as <scope>provided</scope and absent from the uberjar? Does it cause problems if the Ppark and Hadoop dependencies are not <scope>provided</scope>?


Solution

  • It is strictly necessary, per the guidance, you cannot mix implementations and, in the case of databricks/synapse/cloudera/emr etc. it's entirely possible what you deploy to is not 100% the same OSS libraries (and definitely the case with Databricks at least).

    This can rear it's very ugly head with odd and difficult to reproduce initialisation issues, or worse silently produce incorrect results.

    Shading and relocating can only get you so far. Indeed building a shaded scala library for use in notebooks isn't straightforward with maven either (as maven shade plugin does not provide ScalaSig handling). (if this is desired the approach taken by testless and Quality via the scripting plugin works - except for macros).

    The problem can be so thorny that Quality also overrides a number of sensitive library versions in it's build profiles, not matching the targeted runtime literally causes failure here (although relocating classes could solve some of these issues).