I have a Spring boot application which is accessing HDFS through Webhdfs secured via Apache Knox secured by Kerberos. I created my own KnoxWebHdfsFileSystem
with custom scheme (swebhdfsknox) as a subclass of WebHdfsFilesystem
which only changes the URLs to contain the Knox proxy prefix. So it effectively remaps requests from form:
http://host:port/webhdfs/v1/...
to the Knox one:
http://host:port/gateway/default/webhdfs/v1/...
I do this by overriding two methods:
public URI getUri()
URL toUrl(Op op, Path fspath, Param<?, ?>... parameters)
So far so good. I let spring boot create FsShell
for me and use it for various operations such as list files, mkdir etc. All work fine. Except copyFromLocal which as documented requires 2 steps and redirect. And on the last step when the filesystem tries to PUT
to the final URL which received in Location header it fails with error:
org.apache.hadoop.security.AccessControlException: Authentication required
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.validateResponse(WebHdfsFileSystem.java:334) ~[hadoop-hdfs-2.6.0.jar:na]
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$200(WebHdfsFileSystem.java:91) ~[hadoop-hdfs-2.6.0.jar:na]
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$FsPathOutputStreamRunner$1.close(WebHdfsFileSystem.java:787) ~[hadoop-hdfs-2.6.0.jar:na]
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:54) ~[hadoop-common-2.6.0.jar:na]
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:112) ~[hadoop-common-2.6.0.jar:na]
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:366) ~[hadoop-common-2.6.0.jar:na]
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:338) ~[hadoop-common-2.6.0.jar:na]
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:302) ~[hadoop-common-2.6.0.jar:na]
at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1889) ~[hadoop-common-2.6.0.jar:na]
at org.springframework.data.hadoop.fs.FsShell.copyFromLocal(FsShell.java:265) ~[spring-data-hadoop-core-2.2.0.RELEASE.jar:2.2.0.RELEASE]
at org.springframework.data.hadoop.fs.FsShell.copyFromLocal(FsShell.java:254) ~[spring-data-hadoop-core-2.2.0.RELEASE.jar:2.2.0.RELEASE]
I suspect the problem is the redirect somehow but can't figure out what might be the problem here. If I do the same requests via curl the file is successfully uploaded to HDFS.
This is a known issue with using existing Hadoop clients against Apache Knox using the HadoopAuth provider for kerberos on Knox. If you were to use curl or some other REST client it would likely work for you. The existing Hadoop java client doesn't expect a SPNEGO challenge from the DataNode - which is what the PUT in the send step is talking to. The DataNode expects the block access token/delegation token issued by the NameNode in the first step to be present. The Knox gateway however will require SPNEGO authentication for every request to that topology.
This is an issue that is on the roadmap to be addressed and will likely become hotter with interest moving more inside the cluster rather than only accessing resources through it from the outside.
The following JIRA tracks this item and as you can see from the title is related to DistCp which is a similar usecase: https://issues.apache.org/jira/browse/KNOX-482
Feel free to take a look and lend a hand with testing or developing - it would all be most welcome!
Another possibility would be to change the Hadoop java client to deal with a SPNEGO challenge for the DataNode as well.