AWS EMR Bootstrap action "aws s3 cp ..." to download 11GB file failing due to [Errno 28] No space left on device

Using the AWS console, I am trying to start an EMR cluster (incl. HBase and Zookeeper) with a startup script that has downloads 11GB of data from s3 and then puts that file to HDFS. I have a shell script that includes the lines

aws s3 cp s3://path/to/eleven/gb/of/data local/ --recursive
hdfs dfs -put local/ /

The script is on s3 and when I start the cluster I include a Bootstrap action, pointing to the shell script on s3.

However, the cluster fails to launch and gives this error:

Terminated with errors: On the master instance (i-036fb1c03d99115a8), bootstrap action 1 returned a non-zero return code

When I go to the logs, I see this in the stderr output

download failed: s3://path/to/eleven/gb/of/data/d/95d969cadfa644de8d1b2793e0df7822 to local/d/95d969cadfa644de8d1b2793e0df7822 [Errno 28] No space left on device

And the last line of the stdout output is

Completed 5.1 GiB/11.0 GiB (49.5 MiB/s) with 1 file(s) remaining

In the configuration of the cluster, for each node I have set the Root device EBS volume size to 100GB, so I'm not quite sure why there is no space left on the device after downloading 5.1GB of data.

Solution

EMR bootstrap runs as a hadoop user, and working directory /home/hadoop.
Home directory doesn't get much storage.

And with your configuration looks like it's limited to 5.1 GB.

You can put the file somewhere else (ex /etc/temp) other than home directory.

Or better way is to use 1 step process instead of 2 i.e. copy directly to HDFS from S3 using s3-dist-cp. You can find more details here.
And I think this will be the best solution without any configuration change as EMR comes with s3-dist-cp pre installed. And it also will save time by copying files in parallel.

There is another way:
Instead of HDFS you can use EMRFS.
With this you don't have to download at all but has higher cost than regular S3. But have lots of advantage as well. You can start with this