Search code examples
scalajarapache-sparkclasspathapache-tika

Classpath issues running Tika on Spark


I try to process a bunch of files in Tika. The number of files is in the thousands so I decided to build an RDD of files and let Spark distribute the workload. Unfortunatly I get multiple NoClassDefFound Exceptions.

This is my sbt file:

name := "TikaFileParser"
version := "0.1"
scalaVersion := "2.11.7"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1" % "provided"
libraryDependencies += "org.apache.tika" % "tika-core" % "1.11"
libraryDependencies += "org.apache.tika" % "tika-parsers" % "1.11"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.7.1" % "provided"

This is my assembly.sbt

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.1")

And this is the source file:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.input.PortableDataStream
import org.apache.tika.metadata._
import org.apache.tika.parser._
import org.apache.tika.sax.WriteOutContentHandler
import java.io._

object TikaFileParser {

  def tikaFunc (a: (String, PortableDataStream)) = {

    val file : File = new File(a._1.drop(5))
    val myparser : AutoDetectParser = new AutoDetectParser()
    val stream : InputStream = new FileInputStream(file)
    val handler : WriteOutContentHandler = new WriteOutContentHandler(-1)
    val metadata : Metadata = new Metadata()
    val context : ParseContext = new ParseContext()

    myparser.parse(stream, handler, metadata, context)

    stream.close

    println(handler.toString())
    println("------------------------------------------------")
  }


  def main(args: Array[String]) {

    val filesPath = "/home/user/documents/*"
    val conf = new SparkConf().setAppName("TikaFileParser")
    val sc = new SparkContext(conf)
    val fileData = sc.binaryFiles(filesPath)
    fileData.foreach( x => tikaFunc(x))
  }
}

I am running this with

spark-submit --driver-memory 2g --class TikaFileParser --master local[4]
             /path/to/TikaFileParser-assembly-0.1.jar

And get java.lang.NoClassDefFoundError: org/apache/cxf/jaxrs/ext/multipart/ContentDisposition which is a dependency of a parser. Out of curiosity I added the jar containing this class to Spark's --jars option and ran again. This time I got a new NoClassDefFoundError (can't remember which one, but also a Tika dependency).

I already found a similar problem here (Apache Tika 1.11 on Spark NoClassDeftFoundError) where the solution was to build a fat jar. But I would like to know if there is any other way so solve the dependency issues?

Btw: I tried this snippet without Spark (so just use an Array with the file names and a foreach loop and changed the tikaFunc signature accordingly. I ran it without any arguments and it worked perfectly.

Edit: Updateded the snippets now for use with sbt assembly.


Solution

  • The issues came from version mismatches in the jars. I decided on the following sbt file which solves my problem:

    name := "TikaFileParser"
    version := "0.1"
    scalaVersion := "2.11.7"
    
    libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1" % "provided"
    libraryDependencies += "org.apache.tika" % "tika-core" % "1.11"
    libraryDependencies += "org.apache.tika" % "tika-parsers" % "1.11"
    libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.7.1" % "provided"
    
    mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
      {
        case PathList("META-INF", xs @ _*) => MergeStrategy.discard
        case _     => MergeStrategy.first
      }
    }