I use Cloudera cluster with Apache Spark 2.1.0.cloudera1
installed, but I need a new class from the latest commit from Apache Spark git repository:
I just copy-pasted the whole file to my sbt scala project but I don't know how to create sbt-assembly MergeStrategy to exclude the cluster provided class:
org.apache.spark.mllib.linalg.distributed.BlockMatrix
from
org.apache.spark/spark-mllib_2.11/jars/spark-mllib_2.11-2.1.0.cloudera1.jar
and use the newly added project class.
My sbt.build file:
val sparkVersion = "2.1.0.cloudera1"
lazy val providedDependencies = Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.spark" %% "spark-mllib" % sparkVersion
)
libraryDependencies ++= providedDependencies.map(_ % "provided")
assemblyMergeStrategy in assembly := {
case PathList("META-INF", "MANIFEST.MF") => MergeStrategy.discard
case PathList("org", "apache", "spark", "unused", "UnusedStubClass.class") => MergeStrategy.first
case _ => MergeStrategy.first
}
If you want to use Spark that does not correspond to the version as used in your environment just sbt assembly
all the Spark dependencies in a single uber jar and spark-submit
it.
Install sbt-assembly and remove the line where you mark the Spark dependencies provided
(which says to exclude them from assembly
which is exactly the contrary to what we aim for).
libraryDependencies ++= providedDependencies.map(_ % "provided")
You have to use the proper version of Spark, i.e. the following line should be changed too (to reflect the version with BlockMatrix.scala
in question).
val sparkVersion = "2.1.0.cloudera1"
You may want to use your locally-built Spark for this too. The point is to have all the dependencies in a single uber-jar which are supposed to override what's in your deployment environment.