Search code examples
hadoopapache-sparkcloudera-cdhhuecloudera-manager

How to prevent Hue in CDH from clearing job history on restart?


I have installed CDH 5.5.1 with Hue, Hadoop, Spark, Hive, Oozie, Yarn and ZooKeeper.

When I run a Spark job or MapReduce job, Hue displays a issue in the job history. The problem is that when I restart the CDH services (Not the physical nodes), it removes all the job histories that were before the restart.

Job Browser screenshot

On Hadoop there are several files that I suspect have information about the task and might be the ones that hold the job information. Their hadoop paths are:

  • /tmp/logs/user/logs/
  • /user/history/done/2016/

I have looked for it in the Cloudera Manager configuration page, Hue configuration page and some configuration files with no success. I don't know how to prevent this removal. Am I missing something?


Solution

  • If you really just need to see job history on a Hadoop cluster, the YARN History Server should have a history of all YARN jobs run on the cluster.

    Hue has a JIRA ticket for the issue you describe, titled "Job browser should talk to the YARN history server to display old jobs": https://issues.cloudera.org/browse/HUE-2558. Basically, Hue needs to talk to the YARN History Server (not just the Resource Manager) to get the information you're looking for.

    The good news is that the task appears to have been completed and included with the release of Hue 4.0, which occurred on 5/11/2017. The bad news is that Cloudera has not yet done a release with that version of Hue rolled in.