How to use different Spark version than as used in Cloudera's CDH cluster?

I use Cloudera cluster with Apache Spark 2.1.0.cloudera1 installed, but I need a new class from the latest commit from Apache Spark git repository:

BlockMatrix.scala

I just copy-pasted the whole file to my sbt scala project but I don't know how to create sbt-assembly MergeStrategy to exclude the cluster provided class:

org.apache.spark.mllib.linalg.distributed.BlockMatrix

from

org.apache.spark/spark-mllib_2.11/jars/spark-mllib_2.11-2.1.0.cloudera1.jar

and use the newly added project class.

My sbt.build file:

val sparkVersion = "2.1.0.cloudera1"

lazy val providedDependencies = Seq(
  "org.apache.spark" %% "spark-core" % sparkVersion,
  "org.apache.spark" %% "spark-sql" % sparkVersion,
  "org.apache.spark" %% "spark-mllib" % sparkVersion
)

libraryDependencies ++= providedDependencies.map(_ % "provided")

assemblyMergeStrategy in assembly := {
  case PathList("META-INF", "MANIFEST.MF") => MergeStrategy.discard
  case PathList("org", "apache", "spark", "unused", "UnusedStubClass.class") => MergeStrategy.first
  case _ => MergeStrategy.first
}

Solution

If you want to use Spark that does not correspond to the version as used in your environment just sbt assembly all the Spark dependencies in a single uber jar and spark-submit it.

Install sbt-assembly and remove the line where you mark the Spark dependencies provided (which says to exclude them from assembly which is exactly the contrary to what we aim for).

libraryDependencies ++= providedDependencies.map(_ % "provided")

You have to use the proper version of Spark, i.e. the following line should be changed too (to reflect the version with BlockMatrix.scala in question).

val sparkVersion = "2.1.0.cloudera1"

You may want to use your locally-built Spark for this too. The point is to have all the dependencies in a single uber-jar which are supposed to override what's in your deployment environment.