I wish to know how to move data from an EMR cluster's HDFS file system to an S3 bucket. I recognize that I can write directly to S3 in Spark, but in principle it should also be straightforward to do it afterwards, and so far I have not found that to be true in practice.
AWS documentation recommends s3-dist-cp
for the purpose of moving data between HDFS and S3. The documentation for s3-dist-cp
states that the HDFS source should be specified in URL format, i.e., hdfs://path/to/file
. I have so far moved data between HDFS and my local file system by using hadoop fs -get
, which takes a syntax of path/to/file
rather than hdfs://path/to/file
. It is unclear how to map between the two.
I am working from an SSH into the master node. I tried the following, each with both two and three slashes:
hdfs:///[public IP]/path/to/file
hdfs:///[public IP]:8020/path/to/file
hdfs:///localhost/path/to/file
hdfs:///path/to/file
/path/to/file
(and many variants)In each case, my command is formatted as per the documentation:
s3-dist-cp --src hdfs://... --dest s3://my-bucket/destination
I have tried with both individual files and whole directories. In each case, I get an error that the source file does not exist. What am I doing wrong?
Relative and/or non-fully qualified paths are automatically resolved to fully qualified paths based upon the default file system (configured as fs.defaultFS in core-site.xml, and defaulting to hdfs on EMR) and the current working directory, which defaults to /user/.
On EMR, an absolute path like /path/to/file is equivalent to hdfs:///path/to/file. A relative path like path/to/file resolves to hdfs:///user/hadoop/path/to/file (assuming you are running a command as the hadoop user).
The reason you are encountering a "file not found" error with your hdfs:// paths is that (for most of your examples) you are putting the host name in the wrong place, since you have too many slashes before the hostname. If you include the hostname, you should only have two slashes before it. You don't actually need to include the hostname though, so you can also write hdfs:///path/to/file. (Three slashes in a row means that the default hostname will be used.) in most of your examples, since you had three slashes and included the hostname, it took the hostname to be part of the path and not a hostname at all.
In your fourth example (hdfs:///path/to/file), that one is actually a valid path, but it doesn't refer to the same thing as path/to/file, which is a relative path. Similarly to what I mentioned above, /path/to/file is equivalent to hdfs:///path/to/file, while path/to/file is equivalent to hdfs:///user/hadoop/path/to/file.
By the way, if you use a hostname, I'm pretty sure you need to use the private master hostname, not the public IP. (Though, again, you can just leave the hostname off altogether and just use three slashes in a row to indicate that you are not including a hostname.) I would recommend against using the hostname because then you would need to change the path any time you ran the command on a different cluster.
Lastly, it's not exactly true that "hadoop fs -get" only takes non-uri-style paths and s3-dist-cp only takes uri-style paths. Either one of these takes either style of path. "hadoop fs -get /path/to/file" and "hadoop fs -get hdfs:///path/to/file" are both valid and equivalent.