Search code examples
hadoop-yarnsqoopooziehue

Running Sqoop job on YARN using Oozie


I've got a problem with running Sqoop job on YARN in Oozie using Hue. I want to download table from Oracle database and upload that table to HDFS. I've got multinode cluster consists of 4 nodes.

I want to run simple Sqoop statement:

import --options-file /tmp/oracle_dos.txt --table BD.BD_TABLE --target-dir /user/user1/files/user_temp_20160930_30 --m 1

Options file is located on local system on node number 1. Other nodes have no options file in /tmp/ dir. I created Oozie workflow with Sqoop job and tried to run it, but I got error:

3432 [main] ERROR org.apache.sqoop.Sqoop  - Error while expanding arguments
java.lang.Exception: Unable to read options file: /tmp/oracle_dos.txt

The weirdest thing is that the job is sometimes ok, but sometimes fails. The log file gave me answer why - Oozie runs Sqoop jobs on YARN.

Resource Manager (which is component of YARN) decides which node will execute Sqoop job. When Resource Manager decided that Node 1 (which has options file on local file system) should execute job, everything is ok. But when RM decided that one of other 3 nodes should execute Sqoop job, it failed.

This is big problem for me, because I don't want to upload options file on every node (because what if I will have 1000 nodes?). So my question is - is there any way to tell Resource Manager which node it should use?


Solution

  • You can make a custom file available for you oozie action on a node, it can be done by using <file> tag in your sqoop action, look at this syntax:

    <workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.1">
        ...
        <action name="[NODE-NAME]">
            <sqoop xmlns="uri:oozie:sqoop-action:0.2">
                <job-tracker>[JOB-TRACKER]</job-tracker>
                <name-node>[NAME-NODE]</name-node>
                <prepare>
                   <delete path="[PATH]"/>
                   ...
                   <mkdir path="[PATH]"/>
                   ...
                </prepare>
                <configuration>
                    <property>
                        <name>[PROPERTY-NAME]</name>
                        <value>[PROPERTY-VALUE]</value>
                    </property>
                    ...
                </configuration>
                <command>[SQOOP-COMMAND]</command>
                <arg>[SQOOP-ARGUMENT]</arg>
                ...
                <file>[FILE-PATH]</file>
                ...
                <archive>[FILE-PATH]</archive>
                ...
            </sqoop>
            <ok to="[NODE-NAME]"/>
            <error to="[NODE-NAME]"/>
        </action>
        ...
    </workflow-app>
    

    Also read this:

    The file , archive elements make available, to map-reduce jobs, files and archives. If the specified path is relative, it is assumed the file or archiver are within the application directory, in the corresponding sub-path. If the path is absolute, the file or archive it is expected in the given absolute path.

    Files specified with the file element, will be symbolic links in the home directory of the task.

    ...

    So in simplest case you put your file oracle_dos.txt in your workflow directory, add element oracle_dos.txt in workflow.xml and change you command to something like this:

    import --options-file ./oracle_dos.txt --table BD.BD_TABLE --target-dir /user/user1/files/user_temp_20160930_30 --m 1
    

    In this case nevertheless your sqoop action is running on some randomly picked node in a cluster, oozie will copy oracle_dos.txt to this node and you can refer to it as to local file.