Search code examples
oozie-coordinatoroozie-workflow

oozie intial instance and start time giving error on missing dataset


I am new to oozie and trying to understand dataset.xml. I have following dataset and trying to understand what exactly oozie is trying to validate here. what is the meaning of initial instance and what uri-template is doing here(not clear on oozie document)

<dataset name="sample" frequency="${coord:hours(1)}" initial-instance="2022-01-10T00:00Z" timezone="UTC">
        <uri-template>${hdfsdir}/filepath/${YEAR}${MONTH}${DAY}${HOUR}</uri-template>
        <done-flag>_SUCCESS</done-flag>
 </dataset>

Similarly, in coordinator I have following for input and output dataset. Here what is the significance of current(-5) and start parameter?

<coordinator-app name="test" frequency="${freq}" start="2022-01-10T00:00Z" end="2023-04-11T00:00Z" timezone="UTC" xmlns="uri:oozie:coordinator:0.4" xmlns:sla="uri:oozie:sla:0.2">
 
  <data-in name="raw" dataset="raw_data">
            <instance>${coord:current(-5)}</instance>
   </data-in>

<data-out name="processed" dataset="raw_out">
                <instance>${coord:current(-5)}</instance>
       </data-out>

Can someone explain what oozie is expecting on the datasets?

Thanks, bab


Solution

  • Without looking at the documentation, here's what I can guess.

    • initial-instance - When is the dataset first available? If you try to provide a timestamp before this in a workflow or coordinator, you can expect an error.
    • After which, a positive frequency will "count up" from that timestamp
    • uri-template uses built-in Oozie variables to determine what pattern those files exist in the filesystem.

    coord:current(-5) will multiply 5 by the dataset frequency, and return the 5th previous instance... Giving you a dataset 5 hours before the time that the coordinator was started.

    So, for your example, you have dataset name="sample" defined, but your data-in and data-out tags do not reference this, so I don't think anything will run...

    Here's the docs for coord:current (might say something different from my answer) https://oozie.apache.org/docs/5.2.1/CoordinatorFunctionalSpec.html#a6.6.1._coord:currentint_n_EL_Function_for_Synchronous_Datasets

    Section 5.1 seems to mostly answer your question