I'm trying to set up an Oozie map-reduce workflow action to process inputs files spread across multiple directories. Concretely, say my input is spread in the following directories
/data/d_20150629-2200
/data/d_20150630-2210
/data/d_20150530-2220
/data/d_20150531-2230
/data/d_20150701-2240
/data/d_20150702-2250
In general there isn't a straightforward glob pattern to capture the list of files that I expect at runtime.
The input specification in my workflow.xml is:
<property>
<name>mapred.input.dir</name>
<value>${inputFile}</value>
</property>
And the parameter value in my workflow.properties is:
inputFile=/user/streaming/data/d_*
With this, my Oozie job is naturally processing all directories under data
that begin with d_
. Is there a way to modify workflow.xml or workflow.properties to tell Oozie to process files under only the listed six directories?
In Pig, one can specify comma-separated list of input paths. I also came across these two posts (post1, post2) which touch upon the issue. But in my case, I neither want to apply different mappers on different input paths not have different input formats. I just want to specify multiple input directories to the same mapper.
Hadoop version: Hadoop 2.3.0-cdh5.1.5 Oozie client build version: 4.0.0-cdh5.1.5
Thanks for any help.
mapred.input.dir
documentation states that it's
a comma separated list of input directories
So you could you define a property in workflow.properties
inputFilePath=/user/streaming/data
And use it in workflow.xml
<property>
<name>mapred.input.dir</name>
<value>${inputFilePath}/d_20150629-2200,${inputFilePath}/d_20150630-2210, ...</value>
</property>
But if you want to run it only once, then it's easier to copy/move the needed to a separate directory and pass it to input.