Search code examples
hadoopoozie-coordinator

Dealing with irregular timed data with Oozie coordinator


I have multiple sources of data which needs to be considered in a oozie coordinated workflow. Data sets are irregularly generated, what it means is that the data may not be generated some days. For instance:

data_set1:
  ds1-1 - Sept-1-2015 - Data available
  ds1-2 - Sept-2-2015 - No Data
  ds1-3 - Sept-3-2015 - No Data
  ds1-4 - Sept-4-2015 - Data available
  ds1-5 - Sept-5-2015 - Data available
  ds1-6 - Sept-6-2015 - No Data
  ds1-7 - Sept-7-2015 - Data available.

Data_set2
  ds2-1 - Sept-1-2015 - Data available
  ds2-2 - Sept-2-2015 - Data available
  ds2-3 - Sept-3-2015 - Data available
  ds2-4 - Sept-4-2015 - No Data
  ds2-5 - Sept-5-2015 - Data available
  ds2-6 - Sept-6-2015 - Data available.
  ds2-7 - Sept-7-2015 - No Data

My oozie coordinator job is scheduled to run daily. However, since the data set may not necessarily available, I must pick up the dataset whichever is available and latest. For the above given data sets, I expect following datasets to be considered for the each run:

  Sept-1-2015 - ds1-1, ds2-1
  Sept-2-2015 - ds1-1, ds2-2   #since no ds1 available for day2.
  Sept-3-2015 - ds1-1, ds2-3   #since no ds1 available for day3.
  Sept-4-2015 - ds1-4, ds2-3   #since no ds2 available for day4.
  Sept-5-2015 - ds1-5, ds2-5
  Sept-6-2015 - ds1-5, ds2-6   #since no ds1 available for day6
  Sept-7-2015 - ds1-7, ds2-6   #since no ds2 available for day7.

Is there any way to achieve this with the available Oozie constructs?


Solution

  • If you want the latest available data, you should use coord:latest EL Function. Basically, coord:latest means, use the last available data that you can find. If you want to can use coord:latest(n) which means

    ${coord:latest(int n)} represents the nth latest currently available instance of a synchronous dataset.

    In your case, use the below example:

        <data-in name="input" dataset="logs">
          <instance>${coord:latest(0)}</instance>
        </data-in>