I am running spark job in yarn-client mode via oozie spark action. I need to specify driver and application master related settings. I tried configuring spark-opts as documented by oozie but its not working.
Here's from oozie doc:
Example:
<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
...
<action name="myfirstsparkjob">
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>foo:8021</job-tracker>
<name-node>bar:8020</name-node>
<prepare>
<delete path="${jobOutput}"/>
</prepare>
<configuration>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
</configuration>
<master>local[*]</master>
<mode>client<mode>
<name>Spark Example</name>
<class>org.apache.spark.examples.mllib.JavaALS</class>
<jar>/lib/spark-examples_2.10-1.1.0.jar</jar>
<spark-opts>--executor-memory 20G --num-executors 50</spark-opts>
<arg>inputpath=hdfs://localhost/input/file.txt</arg>
<arg>value=2</arg>
</spark>
<ok to="myotherjob"/>
<error to="errorcleanup"/>
</action>
...
</workflow-app>
In above spark-opts are specified as --executor-memory 20G --num-executors 50
while on the same page in description it says:
"The spark-opts element if present, contains a list of spark options that can be passed to spark driver. Spark configuration options can be passed by specifying '--conf key=value' here"
so according to document it should be --conf executor-memory=20G
which one is right here then? I tried both but it's not seem working. I am running on yarn-client mode so mainly want to setup driver related settings. I think this is the only place I can setup driver settings.
<spark-opts>--driver-memory 10g --driver-java-options "-XX:+UseCompressedOops -verbose:gc" --conf spark.driver.memory=10g --conf spark.yarn.am.memory=2g --conf spark.driver.maxResultSize=10g</spark-opts>
<spark-opts>--driver-memory 10g</spark-opts>
None of the above driver related settings getting set in actual driver jvm. I verified it on linux process info.
reference: https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html
I did found what's the issue. In yarn-client mode you can't specify driver related parameters using <spark-opts>--driver-memory 10g</spark-opts>
because your driver (oozie launcher job) is already launched before that point. It's a oozie launcher (which is a mapreduce job) launches your actual spark and any other job and for that job spark-opts is relevant. But to set driver parameters in yarn-client mode you need to basically configure configuration
in oozie workflow:
<configuration>
<property>
<name>oozie.launcher.mapreduce.map.memory.mb</name>
<value>8192</value>
</property>
<property>
<name>oozie.launcher.mapreduce.map.java.opts</name>
<value>-Xmx6000m</value>
</property>
<property>
<name>oozie.launcher.mapreduce.map.cpu.vcores</name>
<value>24</value>
</property>
<property>
<name>mapreduce.job.queuename</name>
<value>default</value>
</property>
</configuration>
I haven't tried yarn-cluster mode but spark-opts may work for driver setting there. But my question was regarding yarn-client mode.