Search code examples
hadoopoozie

Exporting jobs listed in Oozie Web Console


Apologies if this question sounds basic, I'm totally new to Hadoop environment.

What am I looking for?

In my case, there are jobs scheduled to run everday and I would want to export the list of failed jobs in an excel sheet each day.

How do I view the workflow jobs?

Currently I use the Oozie web console to view the jobs and I don't have/see an option to export. Also, I was not able to find this information from the Oozie documentation.

However, I found that jobs can be listed using commands like

$ oozie jobs -oozie http://localhost:8080/oozie -localtime -len 2 -fliter status=RUNNING

Where am I stuck?

I want to filter the failed jobs for a given date and would want to export it as csv/excel data.


Solution

  • @YoungHobbit was right to point at that post which is very similar to this one; his answer was dead on target when it comes to extracting the entire list of jobs that have run on a specific day with the Oozie CLI (command-line interface).
    Just don't forget to specify an "unbounded" reply e.g. -len 999999999 to avoid side effects (defaut is to show only the first 100 matches, which may be way too low if you run a lot of frequent jobs).

    The trick is that you can make a more complex filter such as
      "startCreatedTime=2016-06-28T00:00Z;endcreatedtime=2016-06-28T10:00Z;status=FAILED"
    ... but you cannot request jobs that have FAILED or have been KILLED or have been SUSPENDED (which may result from a temporary YARN or HDFS outage) or are still suspiciously RUNNING (because a sub-workflow is SUSPENDED for instance).
    So your best choice is to get the whole list, then filter out all jobs that have SUCCEEDED, with a plain old grep -- as suggested in another answer.

    Then you will also need a complex sed or awk script to break down the ugly CLI output into a well-formed CSV. Ouch!


    Now, you have an alternative to the Oozie CLI: the Oozie REST API (old Cloudera tutorial here, reference for Oozie V4.2 here) lets you query the Oozie server with any programming language that provides...

    • an HTTP client
    • and a way to parse JSON messages (using plain old regular expressions, if nothing else is available)

    The logic would be basically the same -- fetch the list of all jobs in the desired time window, ignore SUCCEEDED jobs, parse the others to generate a CSV record, dump into a CSV file.
    But your program would be more robust, since it would be based on structured JSON input.

    One more thing: if you are familiar with Microsoft VBA, you can even use an Excel macro to build the report dynamically, in a self-service way. No need to bother with in intermediate CSV file.