hadoop apache-spark oozie cloudera-cdh hue

Run Spark Job via Uber Jar with Oozie and Hue

I'm currently learning how to use Apache Oozie to run Spark Jobs in CDH 5.8 but seems to find problems.

I'm compiling my spark job using IntelliJ > Build Artifact (into Uber JAR / Fat JAR) , and later remove its manifest file. Then I run spark-submit to run the JAR. It works fine.

But when I specified a Spark Action with Oozie. I get the following error:

Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], exception invoking main(), java.lang.ClassNotFoundException: Class org.apache.oozie.action.hadoop.SparkMain not found
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.oozie.action.hadoop.SparkMain not found
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2199)
    at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:234)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.ClassNotFoundException: Class org.apache.oozie.action.hadoop.SparkMain not found
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2105)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2197)
    ... 9 more

job.properties:

oozie.use.system.libpath=false
security_enabled=False
dryrun=False
jobTracker=master.meshiang:8032
nameNode=hdfs://master.meshiang:8020

My Workflow :

<workflow-app name="CSV" xmlns="uri:oozie:workflow:0.4">
    <start to="spark-2bab"/>
    <kill name="Kill">
        <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <action name="spark-2bab">
        <spark xmlns="uri:oozie:spark-action:0.1">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <master>local[*]</master>
            <mode>client</mode>
            <name>MySpark</name>
              <class>ETL.CSVTransform</class>
            <jar>/user/meshiang/jar/Spark-GetData.jar</jar>
              <arg>work_id</arg>
              <arg>csv_table</arg>
              <arg>id/FirstName/Lastname</arg>
              <arg>/user/meshiang/csv/ST1471448595.csv</arg>
              <arg>,</arg>
        </spark>
        <ok to="End"/>
        <error to="Kill"/>
    </action>
    <end name="End"/>
</workflow-app>

What I already did:

When I put the same jar into /lib folder of the workspace, and use it the same way as above. The job ran for 10 minutes, killed itself, and didn't show any Error Code or Message.
I ran the Spark Example job in Hue. I got the following message

Error:

JA018
Error Message   Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.RuntimeException: Stream '/jars/oozie-examples.jar' was not found. at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:219) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:106) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(Tr

My Questions :

Should I only compile the classes that I need and use Oozie ShareLibs? Does Oozie support Uber JARS in general?
If I'm using Pig/Sqoop, Do I need to do the same ?

Solution

To solve the ClassNotFoundException: Class org.apache.oozie.action.hadoop.SparkMain you need to enable the oozie system lib property.

oozie.use.system.libpath=true.

This is required for running any Hive, Pig, Sqoop, Spark etc. jobs.

You can compile and build the spark application jars and put them into a lib directory under the oozie application path. Oozie application path is the directory in HDFS where you store and reference the workflow.xml file.

Hope this will help. Thanks.