When I submit a python script as jar to spark action in oozie, I see the below error :
Traceback (most recent call last):
File "/home/hadoop/spark.py", line 5, in <module>
from pyspark import SparkContext, SparkConf
ImportError: No module named pyspark
Intercepting System.exit(1)
Although I can see that the pyspark libraries exist on my local FS :
$ ls /usr/lib/spark/python/pyspark/
accumulators.py heapq3.py rdd.py statcounter.py
broadcast.py __init__.py rddsampler.py status.py
cloudpickle.py java_gateway.py resultiterable.py storagelevel.py
conf.py join.py serializers.py streaming/
context.py ml/ shell.py tests.py
daemon.py mllib/ shuffle.py traceback_utils.py
files.py profiler.py sql/ worker.py
I know that there were issues with running pyspark on oozie like https://issues.apache.org/jira/browse/OOZIE-2482 but the error I am seeing is different from the JIRA ticket.
Also I am passing --conf spark.yarn.appMasterEnv.SPARK_HOME=/usr/lib/spark --conf spark.executorEnv.SPARK_HOME=/usr/lib/spark
as spark-opts
in my workflow definition.
Here is my sample application for reference :
masterNode ip-xxx-xx-xx-xx.ec2.internal
nameNode hdfs://${masterNode}:8020
jobTracker ${masterNode}:8032
master yarn
mode client
queueName default
oozie.libpath ${nameNode}/user/oozie/share/lib
oozie.use.system.libpath true
oozie.wf.application.path /user/oozie/apps/
<workflow-app name="spark-wf" xmlns="uri:oozie:workflow:0.5">
<start to="spark-action-test"/>
<action name="spark-action-test">
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
</configuration>
<master>${master}</master>
<mode>${mode}</mode>
<name>Spark Example</name>
<jar>/home/hadoop/spark.py</jar>
<spark-opts>--driver-memory 512m --executor-memory 512m --num-executors 4 --conf spark.yarn.appMasterEnv.SPARK_HOME=/usr/lib/spark --conf spark.executorEnv.SPARK_HOME=/usr/lib/spark --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=/usr/lib/spark/python --conf spark.executorEnv.PYTHONPATH=/usr/lib/spark/python --files ${nameNode}/user/oozie/apps/hive-site.xml</spark-opts>
</spark>
<ok to="end"/>
<error to="kill"/>
</action>
<kill name="kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
# sc is an existing SparkContext.
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
conf = SparkConf().setAppName('test_pyspark_oozie')
sc = SparkContext(conf=conf)
sqlContext = HiveContext(sc)
sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
As recommended here - http://www.learn4master.com/big-data/pyspark/run-pyspark-on-oozie, I also did put the following two files: py4j-0.9-src.zip pyspark.zip, under my ${nameNode}/user/oozie/share/lib folder.
I am using a single-node YARN cluster (AWS EMR) & trying to find out I can pass these pyspark modules to python in my oozie application. Any help is appreciated.
You are getting No module named error
because you have not mentioned PYTHONPATH
in your configuration. Add one more line in --conf
with PYTHONPATH=/usr/lib/spark/python
. I don't know how to set this PYTHONPATH
in oozie workflow defination but by adding PYTHONPATH
property in your configuration will definitely solve your problem.