Search code examples
hadoop-yarnclassnotfoundexceptionapache-zeppelinscalding

ClassNotFoundException with Scalding on Zeppelin managed on YARN


I'm trying to get Scalding working on Zeppelin while using YARN. I followed the steps in the docs here to build the interpreter and set up the classpath override. When I run in local mode, code executes properly. However when I run on my cluster via YARN my jobs fail with:

Error: java.lang.ClassNotFoundException: cascading.CascadingException

or

Error: java.lang.ClassNotFoundException: cascading.tuple.TupleException

What is even stranger to me is that I can go into Zeppelin and execute:

import cascading.tuple.TupleException
import cascading.CascadingException

And both appear to have no problem finding those classes. It is only when I try to actually use scalding (on YARN), like loading data into a typed pipe and dumping that I get the ClassNotFoundException. Any ideas on how to debug or what to fix?


Solution

  • It looks like the cascading jars are not distributed to the YARN cluster. Please add "zeppelin/interpreter/scalding/*" to the args.string property of the scalding interpreter.

    Here's the args.string we use:

    -libjars /home/zeppelin-user/zeppelin/interpreter/scalding/,/home/zeppelin-user/deploy-bundle-201608111417/libs/ -Dscalding.reducer.estimator.classes=com.twitter.scalding.reducer_estimation.InputSizeReducerEstimator -Delephantbird.use.combine.input.format=true -Delephantbird.combine.split.size=134217728 --hdfs --repl

    tmpjars contains jars that are distributed to the YARN cluster. You can see its contents with the command below:

    %scalding 
    mode.asInstanceOf[Hdfs].conf.get("tmpjars").split(",").foreach(println)