Apologies if this question sounds basic, I'm totally new to Hadoop environment.
What am I looking for?
In my case, there are jobs scheduled to run everday and I would want to export the list of failed jobs in an excel sheet each day.
How do I view the workflow jobs?
Currently I use the Oozie web console to view the jobs and I don't have/see an option to export. Also, I was not able to find this information from the Oozie documentation.
However, I found that jobs can be listed using commands like
$ oozie jobs -oozie http://localhost:8080/oozie -localtime -len 2 -fliter status=RUNNING
Where am I stuck?
I want to filter the failed jobs for a given date and would want to export it as csv/excel data.
@YoungHobbit was right to point at that post which is very similar to this one; his answer was dead on target when it comes to extracting the entire list of jobs that have run on a specific day with the Oozie CLI (command-line interface).
Just don't forget to specify an "unbounded" reply e.g. -len 999999999
to avoid side effects (defaut is to show only the first 100 matches, which may be way too low if you run a lot of frequent jobs).
The trick is that you can make a more complex filter such as
"startCreatedTime=2016-06-28T00:00Z;endcreatedtime=2016-06-28T10:00Z;status=FAILED"
... but you cannot request jobs that have FAILED or have been KILLED or have been SUSPENDED (which may result from a temporary YARN or HDFS outage) or are still suspiciously RUNNING (because a sub-workflow is SUSPENDED for instance).
So your best choice is to get the whole list, then filter out all jobs that have SUCCEEDED, with a plain old grep
-- as suggested in another answer.
Then you will also need a complex sed
or awk
script to break down the ugly CLI output into a well-formed CSV. Ouch!
The logic would be basically the same -- fetch the list of all jobs in the desired time window, ignore SUCCEEDED jobs, parse the others to generate a CSV record, dump into a CSV file.
But your program would be more robust, since it would be based on structured JSON input.
One more thing: if you are familiar with Microsoft VBA, you can even use an Excel macro to build the report dynamically, in a self-service way. No need to bother with in intermediate CSV file.