Search code examples
javahadoopmapreducehiveoozie

Can Oozie pause a workflow until a certain file is generated/exists?


I'm using Oozie for the first time and finding it a bit hard to parse the specification. I'm trying to create a simple workflow in which I run some queries in Hive, then execute a shell action in order to do some analysis with a different program, and then finally I'd like to execute a Java job through Oozie.

While I understand how to do all of these actions in isolation, how do set up my workflow so that the final Java job waits for a file to be generated before starting? Googling around, I see ways to make the Oozie workflow wait for a dataset to be generated before it starts, but I don't want the entire workflow to wait, as I only want one particular action within the workflow to wait for the input file to be generated.

The input file will be something simple - most likely I'll just have the second action, the shell one, execute some command like touch $(date -u "+%Y-%m-%d-%H").done right before it exits, so that my input file would be a zero-byte file with a name like 2015-07-20-14.done.


Solution

  • Create a cordinator which will look for dataset in specified hdfs location on the given duration.

    Sample coordinator

    <coordinator-app name="FILE_CHECK" frequency="1440" start="2009-02-01T00:00Z" end="2009-02-07T00:00Z" timezone="UTC" xmlns="uri:oozie:coordinator:0.1">
       <datasets>
          <dataset name="datafile" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC">
             <uri-template>hdfs://<URI>:<PORT>/data/feed/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template>
          </dataset>
       </datasets>
       <input-events>
          <data-in name="coorddatafile" dataset="datafile">
              <start-instance>${coord:current(-23)}</start-instance>
              <end-instance>${coord:current(0)}</end-instance>
          </data-in>
       </input-events>
       <action>
          <workflow>
             <app-path>hdfs://<URI>:<PORT>/workflows</app-path>
          </workflow>
       </action>     
    </coordinator-app>