Search code examples
scalaapache-sparksbtsbt-assemblysbt-plugin

SBT run with provided works under the '.' projects but fails with no mercy under any subprojects


I'm working with latest sbt.version=1.5.7.

My assembly.sbt is nothing more than addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "1.1.0") .

I have to work with a subprojects due to requirement need.

I am facing the Spark dependencies with provided scope similar to this post: How to work efficiently with SBT, Spark and "provided" dependencies?

As the above post said, I can manage to Compile / run under the root project but fails when Compile / run in the subproject.

Here's my build.sbt detail:

val deps = Seq(
  "org.apache.spark" %% "spark-sql" % "3.1.2" % "provided",
  "org.apache.spark" %% "spark-core" % "3.1.2" % "provided",
  "org.apache.spark" %% "spark-mllib" % "3.1.2" % "provided",
  "org.apache.spark" %% "spark-avro" % "3.1.2" % "provided",
)

val analyticsFrameless =
  (project in file("."))
    .aggregate(sqlChoreography, impressionModelEtl)
    .settings(
      libraryDependencies ++= deps
    )

lazy val sqlChoreography =
  (project in file("sql-choreography"))
    .settings(libraryDependencies ++= deps)

lazy val impressionModelEtl =
  (project in file("impression-model-etl"))
    // .dependsOn(analytics)
    .settings(
      libraryDependencies ++= deps ++ Seq(
        "com.google.guava" % "guava" % "30.1.1-jre",
        "io.delta" %% "delta-core" % "1.0.0",
        "com.google.cloud.bigdataoss" % "gcs-connector" % "hadoop2-2.1.3"
      )
    )

Compile / run := Defaults
  .runTask(
    Compile / fullClasspath,
    Compile / run / mainClass,
    Compile / run / runner
  )
  .evaluated

impressionModelEtl / Compile / run := Defaults
  .runTask(
    impressionModelEtl / Compile / fullClasspath,
    impressionModelEtl / Compile / run / mainClass,
    impressionModelEtl / Compile / run / runner
  )
  .evaluated

After I execute impressionModelEtl / Compile / run with a simple program:

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

object SparkRead {
  def main(args: Array[String]): Unit = {
    val spark =
      SparkSession
        .builder()
        .master("local[*]")
        .appName("SparkReadTestProvidedScope")
        .getOrCreate()
    spark.stop()
  }
}

, it returns

[error] java.lang.NoClassDefFoundError: org/apache/spark/sql/SparkSession$
[error]         at SparkRead$.main(SparkRead.scala:7)
[error]         at SparkRead.main(SparkRead.scala)
[error]         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[error]         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[error]         at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[error]         at java.base/java.lang.reflect.Method.invoke(Method.java:566)
[error] Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.SparkSession$
[error]         at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471)

That baffles me for days. Please help me out...Thanks so much


Solution

  • Finally I figured out a solution. Just separate the build.sbt file in the parent project into its subproject.

    Like in ./build.sbt:

    import Dependencies._
    ThisBuild / trackInternalDependencies := TrackLevel.TrackIfMissing
    ThisBuild / exportJars                := true
    ThisBuild / scalaVersion              := "2.12.12"
    ThisBuild / version                   := "0.0.1"
    
    ThisBuild / Test / parallelExecution := false
    ThisBuild / Test / fork              := true
    ThisBuild / Test / javaOptions ++= Seq(
      "-Xms512M",
      "-Xmx2048M",
      "-XX:MaxPermSize=2048M",
      "-XX:+CMSClassUnloadingEnabled"
    )
    
    val analyticsFrameless =
      (project in file("."))
        // .dependsOn(sqlChoreography % "compile->compile;test->test", impressionModelEtl % "compile->compile;test->test")
        .settings(
          libraryDependencies ++= deps
        )
    
    lazy val sqlChoreography =
      (project in file("sql-choreography"))
    
    lazy val impressionModelEtl =
      (project in file("impression-model-etl"))
    
    

    While in impression-model-etl dir, create another build.sbt file:

    import Dependencies._
    
    lazy val impressionModelEtl =
      (project in file("."))
        .settings(
          libraryDependencies ++= deps ++ Seq(
            "com.google.guava"            % "guava"         % "30.1.1-jre",
            "io.delta"                   %% "delta-core"    % "1.0.0",
            "com.google.cloud.bigdataoss" % "gcs-connector" % "hadoop2-2.1.3"
          )
          // , assembly / assemblyExcludedJars := {
          //   val cp = (assembly / fullClasspath).value
          //   cp filter { _.data.getName == "org.apache.spark" }
          // }
        )
    
    Compile / run := Defaults
      .runTask(
        Compile / fullClasspath,
        Compile / run / mainClass,
        Compile / run / runner
      )
      .evaluated
    
    assembly / assemblyOption := (assembly / assemblyOption).value.withIncludeBin(false)
    
    assembly / assemblyJarName := s"${name.value}_${scalaBinaryVersion.value}-${sparkVersion}_${version.value}.jar"
    
    name := "impression"
    
    

    And be sure to extract common Spark libraries to the parent project dir, with a Dependencies.scala file:

    import sbt._
    
    object Dependencies {
      // Versions
      lazy val sparkVersion = "3.1.2"
    
      val deps = Seq(
        "org.apache.spark"       %% "spark-sql"                        % sparkVersion             % "provided",
        "org.apache.spark"       %% "spark-core"                       % sparkVersion             % "provided",
        "org.apache.spark"       %% "spark-mllib"                      % sparkVersion             % "provided",
        "org.apache.spark"       %% "spark-avro"                       % sparkVersion             % "provided",
        ...
      )
    }
    
    

    And with all these steps done, it's normal to run Spark code locally in the subproject folder whilst setting Spark dependencies as "provided".