Search code examples
hadoopmapreducejobsooziedistributed-cache

How do I add files to distributed cache in an oozie job


I am implementing an oozie workflow where, in the first job I am reading data from a database using sqoop and writing it to hdfs. In the second job I need to read a large amount of data and use the files I just wrote in job one to process the large data. Here's what I thought of or tried:

  1. Assuming job one writes the files to some directory on hdfs, adding the files to distributed cache in the driver class of job two will not work as oozie workflow knows just about the mapper and reducer classes of the job. (Please correct me if I am wrong here)

  2. I also tried to write to the lib directory of the workflow hoping that the files would then be automatically added to distributed cache but I understood that the lib directory should be read only when the job is running.

  3. I also thought if I could add the files to distributed cache in the setup() of job 2 then I could access them in the mapper/reducer. I am not aware of how one can add files in setup(), is it possible?

How else can I read the output files of the previous job in the subsequent job from distributed cache. I am already using the input directory of job two to read the data that needs to be processed so I cannot use that.

I am using Hadoop 1.2.1, Oozie 3.3.2 on Ubuntu 12.04 virtual machine.


Solution

  • Add the below properties to add files or archives to your map-reduce action . Refer to this documentation for details.

    <file>[FILE-PATH]</file>
                ...
    <archive>[FILE-PATH]</archive>
    

    You can also give input at java command line as shown below.

    <main-class>org.apache.oozie.MyFirstMainClass</main-class>
                <java-opts>-Dblah</java-opts>
                <arg>argument1</arg>
                <arg>argument2</arg>