Issue with Oozie job running on Hadoop - Permissions on /user/history/done_intermediate

First a little background. I have a test CDH cluster, with two nodes. I am trying to execute an Oozie job, to download a file, process it with SPARK, and then index it within Solr.

The cluster is configured to use Kerberos authentication. The CDH version is 5.7.1

When I try to run the job with Oozie,using the following command:

oozie job --oozie https://host:11443/oozie/ -run --config oozieExample/job.properties

It fails with the following exception:

2016-08-12 12:29:40,415 WARN org.apache.oozie.action.hadoop.JavaActionExecutor: SERVER[it4364-cdh01.novalocal] USER[centos] GROUP[-] TOKEN[] APP[stackOverflow] JOB[0000012-160808110839555-oozie-clou-W] ACTION[0000012-160808110839555-oozie-clou-W@Download_Current_Data] Exception in check(). Message[JA017: Could not lookup launched hadoop Job ID [job_1470672690566_0027] which was associated with  action [0000012-160808110839555-oozie-clou-W@Download_Current_Data].  Failing this action!]
org.apache.oozie.action.ActionExecutorException: JA017: Could not lookup launched hadoop Job ID [job_1470672690566_0027] which was associated with  action [0000012-160808110839555-oozie-clou-W@Download_Current_Data].  Failing this action!
        at org.apache.oozie.action.hadoop.JavaActionExecutor.check(JavaActionExecutor.java:1277)
        at org.apache.oozie.command.wf.ActionCheckXCommand.execute(ActionCheckXCommand.java:182)
        at org.apache.oozie.command.wf.ActionCheckXCommand.execute(ActionCheckXCommand.java:56)
        at org.apache.oozie.command.XCommand.call(XCommand.java:286)
        at org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:175)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

With a quick Google search, it seems that this happens when the Job history server is not running, or can not detect the intermediate directory for the jobs.

When executing a ls command on the history directory, I got the following:

[hdfs@it4364-cdh01 ~]$  hadoop fs -ls /user/history
Found 2 items
drwxrwx---   - mapred hadoop          0 2016-08-12 10:36 /user/history/done
drwxrwxrwt   - mapred hadoop          0 2016-08-12 12:29 /user/history/done_intermediate

Which is OK, I guess. In theory, the mapred user should be the owner of the history folder, based on the CDH documentation.

However, when I check the contents of done_intermediate:

[hdfs@it4364-cdh01 ~]$  hadoop fs -ls  /user/history/done_intermediate
Found 1 items
drwxrwx---   - centos hadoop          0 2016-08-12 12:29 /user/history/done_intermediate/centos

Which means that the user centos(the one executing the Oozie job) is the owner of this directory. This prevents the Job history server to read the file, to mark the job as completed, and then, Oozie mark it as failed. The logs states exactly this:

<ommited for brevity>
...
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=mapred, access=READ_EXECUTE, inode="/user/history/done_intermediate/centos":centos:hadoop:drwxrwx---
    at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkFsPermission(DefaultAuthorizationProvider.java:281)
    at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:262)
...
<ommited for brevity>

If I change the ownership of all the content on the history folder, with hadoop fs -ls -R /user/history the history server recognises the job and mark it as completed.

I tried to run the job as the mapred user, changing the .properties file for the job, however, this also fails, now because the mapred user does not have permissions to write on /users folder inside HDFS, so it seems that this is not the correct solution.

Is there some configuration to avoid the user conflict between centos and mapred, in the history folder?

Thanks in advance

Solution

Long story short: this specific HDFS permission issue for job history log collection may have different root causes...

system account mapred cannot be resolved by "Group Mapping" rules
(default config => map Hadoop user names on local Linux users on NameNode host, and retrieve their Linux groups -- but in turn Linux users/groups may be bound to AD, OpenLDAP, etc.)
system account mapred can be resolved, but is not a member of the required hadoop system group (...)
permissions in the hdfs:///user/history/ sub-directories get messed for some unknown reason -- e.g. the "sticky bit" switches from t to T without notice

A similar issue is described in that post: historyserver not able to read log after enabling kerberos (diagnosed as cause #2)

PS: I mentioned the "sticky bit" flip (cause #3) out of personal experience. Still puzzled about what caused the change, by the way.