Search code examples
hadoopcassandrasbtnosuchmethoderrorsbt-assembly

Why isn't guava being shaded properly in my build.sbt?


tl;dr: Here's a repo containing the problem.


Cassandra and HDFS both use guava internally, but neither of them shades the dependency for various reasons. Because the versions of guava aren't binary compatible, I'm finding NoSuchMethodErrors at runtime.

I've tried to shade guava myself in my build.sbt:

val HadoopVersion =  "2.6.0-cdh5.11.0"

// ...

val hadoopHdfs = "org.apache.hadoop" % "hadoop-hdfs" % HadoopVersion
val hadoopCommon = "org.apache.hadoop" % "hadoop-common" % HadoopVersion
val hadoopHdfsTest = "org.apache.hadoop" % "hadoop-hdfs" % HadoopVersion % "test" classifier "tests"
val hadoopCommonTest = "org.apache.hadoop" % "hadoop-common" % HadoopVersion % "test" classifier "tests"
val hadoopMiniDFSCluster = "org.apache.hadoop" % "hadoop-minicluster" % HadoopVersion % Test

// ...

assemblyShadeRules in assembly := Seq(
  ShadeRule.rename("com.google.common.**" -> "shade.com.google.common.@1").inLibrary(hadoopHdfs).inProject,
  ShadeRule.rename("com.google.common.**" -> "shade.com.google.common.@1").inLibrary(hadoopCommon).inProject,
  ShadeRule.rename("com.google.common.**" -> "shade.com.google.common.@1").inLibrary(hadoopHdfsTest).inProject,
  ShadeRule.rename("com.google.common.**" -> "shade.com.google.common.@1").inLibrary(hadoopCommonTest).inProject,
  ShadeRule.rename("com.google.common.**" -> "shade.com.google.common.@1").inLibrary(hadoopMiniDFSCluster).inProject
)

assemblyJarName in assembly := s"${name.value}-${version.value}.jar"

assemblyMergeStrategy in assembly := {
  case PathList("META-INF", "MANIFEST.MF") => MergeStrategy.discard
  case _ => MergeStrategy.first
}

but the runtime exception persists (ha -- it's a cassandra joke, people).

The specific exception is

[info] HdfsEntitySpec *** ABORTED ***
[info]   java.lang.NoSuchMethodError: com.google.common.base.Objects.toStringHelper(Ljava/lang/Object;)Lcom/google/common/base/Objects$ToStringHelper;
[info]   at org.apache.hadoop.metrics2.lib.MetricsRegistry.toString(MetricsRegistry.java:406)
[info]   at java.lang.String.valueOf(String.java:2994)
[info]   at java.lang.StringBuilder.append(StringBuilder.java:131)
[info]   at org.apache.hadoop.ipc.metrics.RetryCacheMetrics.<init>(RetryCacheMetrics.java:46)
[info]   at org.apache.hadoop.ipc.metrics.RetryCacheMetrics.create(RetryCacheMetrics.java:53)
[info]   at org.apache.hadoop.ipc.RetryCache.<init>(RetryCache.java:202)
[info]   at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initRetryCache(FSNamesystem.java:1038)
[info]   at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:949)
[info]   at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:796)
[info]   at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:1040)
[info]   ...

How can I properly shade guava to stop the runtime errors?


Solution

  • The shading rules will only apply when you are building a fat jar. It won't be applied during other sbt tasks.

    If you want to shade some library inside of your hadoop dependencies, you can create a new project with only the hadoop dependencies, shade the libraries, and publish a fat jar with the all the shaded hadoop dependencies.

    This is not a perfect solution, because all of the dependencies in the new hadoop jar will be "unknown" to whom uses them, and you will need to handle conflicts manually.

    Here is the code that you will need in your build.sbt to publish a fat hadoop jar (using your code and sbt assembly docs):

    val HadoopVersion =  "2.6.0-cdh5.11.0"
    
    val hadoopHdfs = "org.apache.hadoop" % "hadoop-hdfs" % HadoopVersion
    val hadoopCommon = "org.apache.hadoop" % "hadoop-common" % HadoopVersion
    val hadoopHdfsTest = "org.apache.hadoop" % "hadoop-hdfs" % HadoopVersion classifier "tests"
    val hadoopCommonTest = "org.apache.hadoop" % "hadoop-common" % HadoopVersion %  classifier "tests"
    val hadoopMiniDFSCluster = "org.apache.hadoop" % "hadoop-minicluster" % HadoopVersion 
    
    lazy val fatJar = project
      .enablePlugins(AssemblyPlugin)
      .settings(
        libraryDependencies ++= Seq(
            hadoopHdfs,
            hadoopCommon,
            hadoopHdfsTest,
            hadoopCommonTest,
            hadoopMiniDFSCluster
        ),
          assemblyShadeRules in assembly := Seq(
          ShadeRule.rename("com.google.common.**" -> "shade.@0").inAll
        ),
        assemblyMergeStrategy in assembly := {
          case PathList("META-INF", "MANIFEST.MF") => MergeStrategy.discard
          case _ => MergeStrategy.first
        },
        artifact in (Compile, assembly) := {
          val art = (artifact in (Compile, assembly)).value
          art.withClassifier(Some("assembly"))
        },
        addArtifact(artifact in (Compile, assembly), assembly),
        crossPaths := false, // Do not append Scala versions to the generated artifacts
        autoScalaLibrary := false, // This forbids including Scala related libraries into the dependency
        skip in publish := true
      )
    
    lazy val shaded_hadoop = project
      .settings(
        name := "shaded-hadoop",
        packageBin in Compile := (assembly in (fatJar, Compile)).value
      )
    

    I haven't tests it, but that is the gist of it.


    I'd like to point out out another issue that I noticed, your merge strategy might cause you problems, since you want to apply different strategies on some of the files. see the default strategy here.
    I would recommend using something like this to preserve the original strategy for everything that is not deduplicate

    assemblyMergeStrategy in assembly := {
              entry: String => {
                val strategy = (assemblyMergeStrategy in assembly).value(entry)
                if (strategy == MergeStrategy.deduplicate) MergeStrategy.first
                else strategy
              }
          }