Search code examples

Understanding the difference between htfp urls and hdfs path

This difference comes to me during weekend, when I was trying to transfer bulk data across different clusters (physically separated in rooms) through hftp by doing

hadoop distcp hftp-path-src hftp-path-dst

hftp url is something like hftp://node:50070/more/path

It failed mid-way, on some files. Logs said

Unhandled internal error. Vertex failed, vertexName=scope-152 ...

I checked those files manually and didn't find anything suspicious. I also tried the following foolish Pig script to see if it could surprise me

data = LOAD '$src_hftp' USING PigStorage('\t', '-schema');
STORE data INTO '$dst_hftp' USING PigStorage('\t', '-schema');

, which failed miserably with the message

"...DAG did not succeed due to VERTEX_FAILURE"

Now how about

hadoop distcp hdfs-path-src hdfs-path-dst

with hdfs-path being something like hdfs://namenode:8020/more/path.

It worked fine. What? Why?

Many many thanks in advance.


In response to @rahulbmv's answer, I did try

hadoop distcp hftp-path-src hdfs-path-dst

which failed in the mid-way also, and I could see some of the transferred file on the dst HDFS, others missing. So I thought this's irrelevant. The reference I referred to was

I also tried logging into dst namenode server and doing

hadoop distcp hftp-path-src normal-path-without-hdfs-or-hftp

Same error happened.

But yes, writing-port should use hdfs protocol. By using hdfs protocol, error persisted. After @rahulbmv has pointed out, the only difference was really the protocol the reader used. I will go back and dig the error messages up later today.


  • hftp is a read-only file system. So you cannot copy into a hftp destination. That said, you should be able to do hadoop distcp hftp-path-src hdfs-path-dst though. You can read more about hftp and the operations that it supports here.