First a little background. I have a test CDH cluster, with two nodes. I am trying to execute an Oozie job, to download a file, process it with SPARK, and then index it within Solr.
The cluster is configured to use Kerberos authentication. The CDH version is 5.7.1
When I try to run the job with Oozie,using the following command:
oozie job --oozie https://host:11443/oozie/ -run --config oozieExample/job.properties
It fails with the following exception:
2016-08-12 12:29:40,415 WARN org.apache.oozie.action.hadoop.JavaActionExecutor: SERVER[it4364-cdh01.novalocal] USER[centos] GROUP[-] TOKEN[] APP[stackOverflow] JOB[0000012-160808110839555-oozie-clou-W] ACTION[0000012-160808110839555-oozie-clou-W@Download_Current_Data] Exception in check(). Message[JA017: Could not lookup launched hadoop Job ID [job_1470672690566_0027] which was associated with action [0000012-160808110839555-oozie-clou-W@Download_Current_Data]. Failing this action!]
org.apache.oozie.action.ActionExecutorException: JA017: Could not lookup launched hadoop Job ID [job_1470672690566_0027] which was associated with action [0000012-160808110839555-oozie-clou-W@Download_Current_Data]. Failing this action!
at org.apache.oozie.action.hadoop.JavaActionExecutor.check(JavaActionExecutor.java:1277)
at org.apache.oozie.command.wf.ActionCheckXCommand.execute(ActionCheckXCommand.java:182)
at org.apache.oozie.command.wf.ActionCheckXCommand.execute(ActionCheckXCommand.java:56)
at org.apache.oozie.command.XCommand.call(XCommand.java:286)
at org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:175)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
With a quick Google search, it seems that this happens when the Job history server is not running, or can not detect the intermediate directory for the jobs.
When executing a ls
command on the history directory, I got the following:
[hdfs@it4364-cdh01 ~]$ hadoop fs -ls /user/history
Found 2 items
drwxrwx--- - mapred hadoop 0 2016-08-12 10:36 /user/history/done
drwxrwxrwt - mapred hadoop 0 2016-08-12 12:29 /user/history/done_intermediate
Which is OK, I guess. In theory, the mapred
user should be the owner of the history folder, based on the CDH documentation.
However, when I check the contents of done_intermediate:
[hdfs@it4364-cdh01 ~]$ hadoop fs -ls /user/history/done_intermediate
Found 1 items
drwxrwx--- - centos hadoop 0 2016-08-12 12:29 /user/history/done_intermediate/centos
Which means that the user centos
(the one executing the Oozie job) is the owner of this directory. This prevents the Job history server to read the file, to mark the job as completed, and then, Oozie mark it as failed. The logs states exactly this:
<ommited for brevity>
...
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=mapred, access=READ_EXECUTE, inode="/user/history/done_intermediate/centos":centos:hadoop:drwxrwx---
at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkFsPermission(DefaultAuthorizationProvider.java:281)
at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:262)
...
<ommited for brevity>
If I change the ownership of all the content on the history folder, with hadoop fs -ls -R /user/history
the history server recognises the job and mark it as completed.
I tried to run the job as the mapred user, changing the .properties file for the job, however, this also fails, now because the mapred user does not have permissions to write on /users
folder inside HDFS, so it seems that this is not the correct solution.
Is there some configuration to avoid the user conflict between centos
and mapred
, in the history folder?
Thanks in advance
Long story short: this specific HDFS permission issue for job history log collection may have different root causes...
mapred
cannot be resolved by "Group Mapping" rulesmapred
can be resolved, but is not a member of the required hadoop
system group (...)hdfs:///user/history/
sub-directories get messed for some unknown reason -- e.g. the "sticky bit" switches from t
to T
without noticeA similar issue is described in that post: historyserver not able to read log after enabling kerberos (diagnosed as cause #2)
PS: I mentioned the "sticky bit" flip (cause #3) out of personal experience. Still puzzled about what caused the change, by the way.