Attempting to create a fat jar with sbt gives an error like this:
java.lang.RuntimeException: deduplicate: different file contents found in the following:
C:\Users\db\.ivy2\cache\org.apache.spark\spark-network-common_2.10\jars\spark-network-common_2.10-1.6.3.jar:com/google/common/base/Function.class
C:\Users\db\.ivy2\cache\com.google.guava\guava\bundles\guava-14.0.1.jar:com/google/common/base/Function.class
There are many classes, this is just one for the sake of example. Guava 14.0.1 is the version in play for Function.class in both jars:
[info] +-com.google.guava:guava:14.0.1
...
[info] | | +-com.google.guava:guava:14.0.1
That means sbt/ivy won't pick one as the newer version, but the sizes and dates are different in the jars, which presumably leads to the error above:
$ jar tvf /c/Users/db/.ivy2/cache/org.apache.spark/spark-network-common_2.10/jars/spark-network-common_2.10-1.6.3.jar | grep "com/google/common/base/Function.class"
549 Wed Nov 02 16:03:20 CDT 2016 com/google/common/base/Function.class
$ jar tvf /c/Users/db/.ivy2/cache/com.google.guava/guava/bundles/guava-14.0.1.jar | grep "com/google/common/base/Function.class"
543 Thu Mar 14 19:56:52 CDT 2013 com/google/common/base/Function.class
It appears that Apache is re-compiling Function.class
from source rather than including the class as originally compiled.
Is that a correct understanding of what's happening here? Now, it's possible to exclude the re-compiled classes using sbt, but is there
a way to build the jar without explicitly excluding each jar containing re-compiled source by name? Excluding the jars explicitly leads to something
along the lines of the snippet below, which makes it seem that I'm going down a wrong path here:
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.3"
excludeAll(
ExclusionRule(organization = "com.twitter"),
ExclusionRule(organization = "org.apache.spark", name = "spark-network-common_2.10"),
ExclusionRule(organization = "org.apache.hadoop", name = "hadoop-client"),
ExclusionRule(organization = "org.apache.hadoop", name = "hadoop-hdfs"),
ExclusionRule(organization = "org.tachyonproject", name = "tachyon-client"),
ExclusionRule(organization = "commons-beanutils", name = "commons-beanutils"),
ExclusionRule(organization = "commons-collections", name = "commons-collections"),
ExclusionRule(organization = "org.apache.hadoop", name = "hadoop-yarn-api"),
ExclusionRule(organization = "org.apache.hadoop", name = "hadoop-yarn-common"),
ExclusionRule(organization = "org.apache.curator", name = "curator-recipes")
)
,
libraryDependencies += "org.apache.spark" %% "spark-network-common" % "1.6.3" exclude("com.google.guava", "guava"),
libraryDependencies += "org.apache.spark" %% "spark-graphx" % "1.6.3",
libraryDependencies += "com.typesafe.scala-logging" %% "scala-logging-slf4j" % "2.1.2",
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.2.0" exclude("com.google.guava", "guava"),
libraryDependencies += "com.google.guava" % "guava" % "14.0.1",
libraryDependencies += "org.json4s" %% "json4s-native" % "3.2.11",
libraryDependencies += "org.json4s" %% "json4s-ext" % "3.2.11",
libraryDependencies += "com.rabbitmq" % "amqp-client" % "4.1.1",
libraryDependencies += "commons-codec" % "commons-codec" % "1.10",
If that's the wrong path, what's a cleaner way?
If that's the wrong path, what's a cleaner way?
The cleaner way is to not package spark-core
at all, it's available to you when you install Spark on your target machines and will be available to your applications at runtime (you can usually find them under the /usr/lib/spark/jars
).
You should mark these spark dependencies as % provided
. This should help you avoid many of the conflicts inflicted by packing these jars.