python hadoop pyspark oozie jupyter-notebook

scheduling pyspark notebooks

I have an ipython notebook that contains some PySpark code on a cluster. Currently we are using oozie to run these notebooks on Hadoop via HUE. The setup feels less than ideal and we were wondering if there is an alternative.

We first convert the .ipynb file into a .py file and move it to hdfs. Along with this file we also create a .sh file that calls the python file. The contents are similar to:

#!/bin/sh
set -e

[ -r     /usr/local/virtualenv/pyspark/bin/activate ] &&
source /usr/local/virtualenv/pyspark/bin/activate

spark-submit --master yarn-client --<setting> <setting_val>  <filename>.py

Next we have Oozie point to this .sh file. This flow feels a bit cumbersome and Oozie doesn't give us great insight in what goes wrong when something fails. We do like it how Oozie knows how to run tasks in parallell or serial depending on your configuration.

Is there a better, smoother way of just scheduling pyspark notebooks?

Solution

OOZIE-2482 was added to Oozie's master lately, which should make running PySpark jobs easier. It's not in a release yet unfortunately.

A Spark Action can be added to your workflow, the py file should be specified in the tag. The py file and the Spark version's pyspark.zip and py4j--src.zip have to be added into the lib/ folder next to the workflow.xml and it should work.