Search code examples
scalaapache-sparkstanford-nlp

Scala - Spark-corenlp - java.lang.NoClassDefFoundError


I want to run spark-coreNLP example, but I get an java.lang.NoClassDefFoundError error when running spark-submit.

Here is the scala code, from the github example, which I put into an object, and defined a SparkContext and SQLContext

main.scala.Sentiment.scala

package main.scala


import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SQLContext

import com.databricks.spark.corenlp.functions._


object SQLContextSingleton {

  @transient  private var instance: SQLContext = _

  def getInstance(sparkContext: SparkContext): SQLContext = {
    if (instance == null) {
      instance = new SQLContext(sparkContext)
    }
    instance
  }
}


object Sentiment {
  def main(args: Array[String]) {

    val conf = new SparkConf().setAppName("Sentiment")
    val sc = new SparkContext(conf)
    val sqlContext = SQLContextSingleton.getInstance(sc)
    import sqlContext.implicits._ 


    val input = Seq((1, "<xml>Stanford University is located in California. It is a great university.</xml>")).toDF("id", "text")

    val output = input
      .select(cleanxml('text).as('doc))
      .select(explode(ssplit('doc)).as('sen))
      .select('sen, tokenize('sen).as('words), ner('sen).as('nerTags), sentiment('sen).as('sentiment))

    output.show(truncate = false)
  }
}

And my build.sbt (modified from here)

version := "1.0"

scalaVersion := "2.10.6"

scalaSource in Compile := baseDirectory.value / "src"

initialize := {
  val _ = initialize.value
  val required = VersionNumber("1.8")
  val current = VersionNumber(sys.props("java.specification.version"))
  assert(VersionNumber.Strict.isCompatible(current, required), s"Java $required required.")
}

sparkVersion := "1.5.2"

// change the value below to change the directory where your zip artifact will be created
spDistDirectory := target.value

sparkComponents += "mllib"

// add any sparkPackageDependencies using sparkPackageDependencies.
// e.g. sparkPackageDependencies += "databricks/spark-avro:0.1"
spName := "databricks/spark-corenlp"

licenses := Seq("GPL-3.0" -> url("http://opensource.org/licenses/GPL-3.0"))

resolvers += Resolver.mavenLocal


libraryDependencies ++= Seq(
  "edu.stanford.nlp" % "stanford-corenlp" % "3.6.0",
  "edu.stanford.nlp" % "stanford-corenlp" % "3.6.0" classifier "models",
  "com.google.protobuf" % "protobuf-java" % "2.6.1"
)

I run sbt package without issue, then run Spark with

spark-submit --class "main.scala.Sentiment" --master local[4] target/scala-2.10/sentimentanalizer_2.10-1.0.jar

The program fails after throwing an exception:

Exception in thread "main" java.lang.NoClassDefFoundError: edu/stanford/nlp/simple/Sentence
    at main.scala.com.databricks.spark.corenlp.functions$$anonfun$cleanxml$1.apply(functions.scala:55)
    at main.scala.com.databricks.spark.corenlp.functions$$anonfun$cleanxml$1.apply(functions.scala:54)
    at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:75)
    at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:74)

Things I tried:

I work with Eclipse for Scala, and I made sure to add all the jars from stanford-corenlp as suggested here

./stanford-corenlp/ejml-0.23.jar
./stanford-corenlp/javax.json-api-1.0-sources.jar
./stanford-corenlp/javax.json.jar
./stanford-corenlp/joda-time-2.9-sources.jar
./stanford-corenlp/joda-time.jar
./stanford-corenlp/jollyday-0.4.7-sources.jar
./stanford-corenlp/jollyday.jar
./stanford-corenlp/protobuf.jar
./stanford-corenlp/slf4j-api.jar
./stanford-corenlp/slf4j-simple.jar
./stanford-corenlp/stanford-corenlp-3.6.0-javadoc.jar
./stanford-corenlp/stanford-corenlp-3.6.0-models.jar
./stanford-corenlp/stanford-corenlp-3.6.0-sources.jar
./stanford-corenlp/stanford-corenlp-3.6.0.jar
./stanford-corenlp/xom-1.2.10-src.jar
./stanford-corenlp/xom.jar

I suspect that I need to add something to my command line when submitting the job to Spark, any thoughts?


Solution

  • I was on the right track that my command line was missing something.

    spark-submit needs to have all the stanford-corenlp added:

        spark-submit 
    --jars $(echo stanford-corenlp/*.jar | tr ' ' ',') 
    --class "main.scala.Sentiment" 
    --master local[4] target/scala-2.10/sentimentanalizer_2.10-1.0.jar