Search code examples
apache-sparkoozie

oozie spark action - how to specify spark-opts


I am running spark job in yarn-client mode via oozie spark action. I need to specify driver and application master related settings. I tried configuring spark-opts as documented by oozie but its not working.

Here's from oozie doc:

Example:

<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
    ...
    <action name="myfirstsparkjob">
        <spark xmlns="uri:oozie:spark-action:0.1">
            <job-tracker>foo:8021</job-tracker>
            <name-node>bar:8020</name-node>
            <prepare>
                <delete path="${jobOutput}"/>
            </prepare>
            <configuration>
                <property>
                    <name>mapred.compress.map.output</name>
                    <value>true</value>
                </property>
            </configuration>
            <master>local[*]</master>
            <mode>client<mode>
            <name>Spark Example</name>
            <class>org.apache.spark.examples.mllib.JavaALS</class>
            <jar>/lib/spark-examples_2.10-1.1.0.jar</jar>
            <spark-opts>--executor-memory 20G --num-executors 50</spark-opts>
            <arg>inputpath=hdfs://localhost/input/file.txt</arg>
            <arg>value=2</arg>
        </spark>
        <ok to="myotherjob"/>
        <error to="errorcleanup"/>
    </action>
    ...
</workflow-app>

In above spark-opts are specified as --executor-memory 20G --num-executors 50

while on the same page in description it says:

"The spark-opts element if present, contains a list of spark options that can be passed to spark driver. Spark configuration options can be passed by specifying '--conf key=value' here"

so according to document it should be --conf executor-memory=20G

which one is right here then? I tried both but it's not seem working. I am running on yarn-client mode so mainly want to setup driver related settings. I think this is the only place I can setup driver settings.

<spark-opts>--driver-memory 10g --driver-java-options "-XX:+UseCompressedOops -verbose:gc" --conf spark.driver.memory=10g --conf spark.yarn.am.memory=2g --conf spark.driver.maxResultSize=10g</spark-opts>

<spark-opts>--driver-memory 10g</spark-opts>

None of the above driver related settings getting set in actual driver jvm. I verified it on linux process info.

reference: https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html


Solution

  • I did found what's the issue. In yarn-client mode you can't specify driver related parameters using <spark-opts>--driver-memory 10g</spark-opts> because your driver (oozie launcher job) is already launched before that point. It's a oozie launcher (which is a mapreduce job) launches your actual spark and any other job and for that job spark-opts is relevant. But to set driver parameters in yarn-client mode you need to basically configure configuration in oozie workflow:

    <configuration>
                   <property>
                            <name>oozie.launcher.mapreduce.map.memory.mb</name>
                            <value>8192</value>
                    </property>
                    <property>
                            <name>oozie.launcher.mapreduce.map.java.opts</name>
                            <value>-Xmx6000m</value>
                    </property>
                    <property>
                            <name>oozie.launcher.mapreduce.map.cpu.vcores</name>
                            <value>24</value>
                    </property>
                    <property>
                        <name>mapreduce.job.queuename</name>
                        <value>default</value>
                    </property>
                </configuration> 
    

    I haven't tried yarn-cluster mode but spark-opts may work for driver setting there. But my question was regarding yarn-client mode.