Search code examples
hadoopoozieoozie-coordinator

Oozie hourly coordinator timing out on future actions


At the 5 minute mark of every hour, I have data from the past hour loaded into hdfs. I thought I could setup a coordinator job to run at 10 minute mark of every hour to process this data while doing a check if the directory for that hour exists. What ends up happening is the coordinator will perform normal on past hour's data at time of submission, continue working fine for the next 2 hours and then future actions will go from 'waiting' to 'timedout'. My guess is that there is a default max limit for how long an action can stay in 'waiting'. It seems a bit counterintuitive for the time out limit to apply to all actions at an absolute future time. Anyway, here's a sample of the coordinator.xml. I'm looking for any suggestions on either how to design it in a way that makes more sense or on how to raise the default timeout.

<datasets>
    <dataset name="hourly_cl" frequency="${coord:hours(1)}" initial-instance="2016-02-08T11:10Z" timezone="PST">
        <uri-template>hdfs://user/tzl/warehouse/incoming/logmessages.log.${YEAR}${MONTH}${DAY}/${HOUR}/</uri-template>
        <done-flag></done-flag>
    </dataset>
    <dataset name="hourly_cl_out" frequency="${coord:hours(1)}" initial-instance="2016-02-05T11:10Z" timezone="PST">
        <uri-template>hdfs://user/tzl/warehouse/output/logmessages.log.${YEAR}${MONTH}${DAY}/${HOUR}/</uri-template>
        <done-flag></done-flag>
    </dataset>
</datasets>

<input-events>
    <data-in name="coordInput1" dataset="hourly_cl">
        <instance>${coord:current(-1)}</instance>
    </data-in>
</input-events>
<output-events>
    <data-out name="clout" dataset="hourly_cl_out">
        <instance>${coord:current(-1)}</instance>
    </data-out>
</output-events>

<action>
    <workflow>
        <app-path>${appPath}</app-path>
    <configuration>
        <property>
            <name>inputPath</name>
            <value>${coord:dataIn('coordInput1')}</value>
        </property>
        <property>
            <name>outputPath</name>
        <value>${coord:dataOut('clout')}</value>
        </property>
    </configuration>
    </workflow>
</action>

Also noticed while looking at logs that oozie checks EVERY MINUTE for each data directory. In other words at 18:01 it'll check these exists logmessages.log.20160208/18

logmessages.log.20160208/19

logmessages.log.20160208/20

logmessages.log.20160208/21

...

and at 18:02 again it'll check logmessages.log.20160208/18

logmessages.log.20160208/19

logmessages.log.20160208/20

logmessages.log.20160208/21

...

This is probably taking up unnecessary cpu cycles. I assumed by setting the frequency to an hour, it would be smart enough to not waste time checking for future datasets when I've defined the instance to be past hour's data: current(-1)


Solution

  • I solved this issue with a simple property adjustment. By introducing this under coordinator-app

    <coordinator-app name="cl_test" frequency="${coord:hours(1)}" start="..."   end="..." timezone="PST" xmlns="uri:oozie:coordinator:0.2">
        <controls>
            <timeout>1440</timeout>
            <concurrency>2</concurrency>
            <throttle>1</throttle>
        </controls>
    ...
    ...
    </coordinator-app>
    

    Specifically, the <throttle> property limits how many actions can be put into waiting status. So by setting it to 1, the timeout time only applies to the next action that is in 'waiting' status. <timeout> also changes the timeout limit for 'waiting' actions, while I believe <concurrency> limits how many actions can be running at once.