Search code examples
hadoopapache-sparkclouderaooziecloudera-cdh

Running Spark2 from Oozie (CDH)


I am attempting to run a spark job (using spark2-submit) from Oozie, so this job can be run on a schedule.

The job runs just fine when running we run the shell script from command-line under our service account (not Yarn). When we run it as a Oozie Workflow the following happens:

17/11/16 12:03:55 ERROR spark.SparkContext: Error initializing SparkContext.
org.apache.hadoop.security.AccessControlException: Permission denied: 
user=yarn, access=WRITE, inode="/user":hdfs:supergroup:drwxrwxr-x

Oozie is running the job as the user Yarn. IT has denied us any ability to change Yarn's permissions in HDFS, and there is not a single reference to the user directory in the Spark script. We have attempted to ssh into the server - though this doesn't work - we have to ssh out of our worker nodes, onto the master.

The shell script:

spark2-submit --name "SparkRunner" --master yarn --deploy-mode client --class org.package-name.Runner  hdfs://manager-node-hdfs/Analytics/Spark_jars/SparkRunner.jar

Any help would be appreciated.


Solution

  • I was able to fix this by following https://stackoverflow.com/a/32834087/8099994

    At the beginning of my shell script I now include the following line:

    export HADOOP_USER_NAME=serviceAccount;