It's my first time trying to run a Spark Action containing a Pyspark Script in Oozie. Please note, that i'm using cdh5.13 in my local machine (vm with 12G of RAM), and HUE to build the workflow.
The workflow.xml as follow:
<workflow-app name="sparkMLpy" xmlns="uri:oozie:workflow:0.5">
<start to="spark-c06a"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
<action name="spark-c06a">
<spark xmlns="uri:oozie:spark-action:0.2">
<ok to="End"/>
<error to="Kill"/>
<end name="End"/>
I've also tried to add some options:
--conf spark.dynamicAllocation.enabled=true
--conf spark.shuffle.service.enabled=true
--conf spark.dynamicAllocation.minExecutors=1
Here is the Pyspark script (it does pretty much nothing):
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
header = log_txt.first()
log_txt = log_txt.filter(lambda line: line != header)
temp_var = k: k.split(","))
c_path_out = "/user/cloudera/output/Frth"
Here is a view of the workflow in HUE:
When I run the workflow, it gives no error but it keeps running with no result (it's not even suspended). Here is a veiw of the logs below:
I've tried to add the options bellow:
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=/usr/local/bin/python2.7
--conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=/usr/local/bin/python2.7
And it is always stuck in running. When I verified the logs I found this warnings:
2019-01-04 02:05:32,398 [Timer-0] WARN org.apache.spark.scheduler.cluster.YarnScheduler - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2019-01-04 02:05:47,397 [Timer-0] WARN org.apache.spark.scheduler.cluster.YarnScheduler - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Can you please help!
I had to run the same workflow on local (not yarn) and it works!