I try to process a bunch of files in Tika. The number of files is in the thousands so I decided to build an RDD of files and let Spark distribute the workload. Unfortunatly I get multiple NoClassDefFound
Exceptions.
This is my sbt file:
name := "TikaFileParser"
version := "0.1"
scalaVersion := "2.11.7"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1" % "provided"
libraryDependencies += "org.apache.tika" % "tika-core" % "1.11"
libraryDependencies += "org.apache.tika" % "tika-parsers" % "1.11"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.7.1" % "provided"
This is my assembly.sbt
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.1")
And this is the source file:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.input.PortableDataStream
import org.apache.tika.metadata._
import org.apache.tika.parser._
import org.apache.tika.sax.WriteOutContentHandler
import java.io._
object TikaFileParser {
def tikaFunc (a: (String, PortableDataStream)) = {
val file : File = new File(a._1.drop(5))
val myparser : AutoDetectParser = new AutoDetectParser()
val stream : InputStream = new FileInputStream(file)
val handler : WriteOutContentHandler = new WriteOutContentHandler(-1)
val metadata : Metadata = new Metadata()
val context : ParseContext = new ParseContext()
myparser.parse(stream, handler, metadata, context)
stream.close
println(handler.toString())
println("------------------------------------------------")
}
def main(args: Array[String]) {
val filesPath = "/home/user/documents/*"
val conf = new SparkConf().setAppName("TikaFileParser")
val sc = new SparkContext(conf)
val fileData = sc.binaryFiles(filesPath)
fileData.foreach( x => tikaFunc(x))
}
}
I am running this with
spark-submit --driver-memory 2g --class TikaFileParser --master local[4]
/path/to/TikaFileParser-assembly-0.1.jar
And get java.lang.NoClassDefFoundError: org/apache/cxf/jaxrs/ext/multipart/ContentDisposition
which is a dependency of a parser. Out of curiosity I added the jar containing this class to Spark's --jars option and ran again. This time I got a new NoClassDefFoundError
(can't remember which one, but also a Tika dependency).
I already found a similar problem here (Apache Tika 1.11 on Spark NoClassDeftFoundError) where the solution was to build a fat jar. But I would like to know if there is any other way so solve the dependency issues?
Btw: I tried this snippet without Spark (so just use an Array with the file names and a foreach loop and changed the tikaFunc signature accordingly. I ran it without any arguments and it worked perfectly.
Edit: Updateded the snippets now for use with sbt assembly.
The issues came from version mismatches in the jars. I decided on the following sbt file which solves my problem:
name := "TikaFileParser"
version := "0.1"
scalaVersion := "2.11.7"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1" % "provided"
libraryDependencies += "org.apache.tika" % "tika-core" % "1.11"
libraryDependencies += "org.apache.tika" % "tika-parsers" % "1.11"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.7.1" % "provided"
mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
{
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case _ => MergeStrategy.first
}
}