Search code examples
hadoopoozie

Oozie processing input in multiple directories with one mapper


I'm trying to set up an Oozie map-reduce workflow action to process inputs files spread across multiple directories. Concretely, say my input is spread in the following directories

/data/d_20150629-2200
/data/d_20150630-2210
/data/d_20150530-2220
/data/d_20150531-2230
/data/d_20150701-2240
/data/d_20150702-2250

In general there isn't a straightforward glob pattern to capture the list of files that I expect at runtime.

The input specification in my workflow.xml is:

<property>
    <name>mapred.input.dir</name>
    <value>${inputFile}</value>
</property>

And the parameter value in my workflow.properties is:

inputFile=/user/streaming/data/d_*

With this, my Oozie job is naturally processing all directories under data that begin with d_. Is there a way to modify workflow.xml or workflow.properties to tell Oozie to process files under only the listed six directories?

In Pig, one can specify comma-separated list of input paths. I also came across these two posts (post1, post2) which touch upon the issue. But in my case, I neither want to apply different mappers on different input paths not have different input formats. I just want to specify multiple input directories to the same mapper.

Hadoop version: Hadoop 2.3.0-cdh5.1.5 Oozie client build version: 4.0.0-cdh5.1.5

Thanks for any help.


Solution

  • mapred.input.dir documentation states that it's

    a comma separated list of input directories

    So you could you define a property in workflow.properties

    inputFilePath=/user/streaming/data
    

    And use it in workflow.xml

    <property>
        <name>mapred.input.dir</name>
        <value>${inputFilePath}/d_20150629-2200,${inputFilePath}/d_20150630-2210, ...</value>
    </property>
    

    But if you want to run it only once, then it's easier to copy/move the needed to a separate directory and pass it to input.