python apache-spark pyspark environment-variables oozie

How to run a spark action (a pyspark script) on oozie 4.2.0?

When I submit a python script as jar to spark action in oozie, I see the below error :

Traceback (most recent call last):
  File "/home/hadoop/spark.py", line 5, in <module>
    from pyspark import SparkContext, SparkConf
ImportError: No module named pyspark
Intercepting System.exit(1)

Although I can see that the pyspark libraries exist on my local FS :

$ ls /usr/lib/spark/python/pyspark/
accumulators.py     heapq3.py           rdd.py              statcounter.py
broadcast.py        __init__.py         rddsampler.py       status.py
cloudpickle.py      java_gateway.py     resultiterable.py   storagelevel.py
conf.py             join.py             serializers.py      streaming/
context.py          ml/                 shell.py            tests.py
daemon.py           mllib/              shuffle.py          traceback_utils.py
files.py            profiler.py         sql/                worker.py

I know that there were issues with running pyspark on oozie like https://issues.apache.org/jira/browse/OOZIE-2482 but the error I am seeing is different from the JIRA ticket.

Also I am passing --conf spark.yarn.appMasterEnv.SPARK_HOME=/usr/lib/spark --conf spark.executorEnv.SPARK_HOME=/usr/lib/spark as spark-opts in my workflow definition.

Here is my sample application for reference :

job.properties

masterNode ip-xxx-xx-xx-xx.ec2.internal
nameNode hdfs://${masterNode}:8020
jobTracker ${masterNode}:8032
master yarn
mode client
queueName default
oozie.libpath ${nameNode}/user/oozie/share/lib
oozie.use.system.libpath true
oozie.wf.application.path /user/oozie/apps/

workflow.xml (at ${nameNode}/user/oozie/apps/)

<workflow-app name="spark-wf" xmlns="uri:oozie:workflow:0.5"> 
    <start to="spark-action-test"/> 
    <action name="spark-action-test"> 
        <spark xmlns="uri:oozie:spark-action:0.1"> 
            <job-tracker>${jobTracker}</job-tracker> 
            <name-node>${nameNode}</name-node> 
            <configuration>  
                <property> 
                    <name>mapred.compress.map.output</name> 
                    <value>true</value> 
                </property> 
            </configuration> 
            <master>${master}</master> 
            <mode>${mode}</mode>
            <name>Spark Example</name>
            <jar>/home/hadoop/spark.py</jar>
            <spark-opts>--driver-memory 512m --executor-memory 512m --num-executors 4 --conf spark.yarn.appMasterEnv.SPARK_HOME=/usr/lib/spark --conf spark.executorEnv.SPARK_HOME=/usr/lib/spark --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=/usr/lib/spark/python --conf spark.executorEnv.PYTHONPATH=/usr/lib/spark/python --files ${nameNode}/user/oozie/apps/hive-site.xml</spark-opts>
        </spark> 
        <ok to="end"/> 
        <error to="kill"/> 
    </action> 
    <kill name="kill"> 
        <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> 
    </kill> 
    <end name="end"/> 
</workflow-app>

spark.py (at /home/hadoop/)

# sc is an existing SparkContext.
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
conf = SparkConf().setAppName('test_pyspark_oozie')
sc = SparkContext(conf=conf)

sqlContext = HiveContext(sc)


sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")

As recommended here - http://www.learn4master.com/big-data/pyspark/run-pyspark-on-oozie, I also did put the following two files: py4j-0.9-src.zip pyspark.zip, under my ${nameNode}/user/oozie/share/lib folder.

I am using a single-node YARN cluster (AWS EMR) & trying to find out I can pass these pyspark modules to python in my oozie application. Any help is appreciated.

Solution

You are getting No module named error because you have not mentioned PYTHONPATH in your configuration. Add one more line in --conf with PYTHONPATH=/usr/lib/spark/python. I don't know how to set this PYTHONPATH in oozie workflow defination but by adding PYTHONPATH property in your configuration will definitely solve your problem.