AWS S3 upload via CLI fails after some time with exception "Could not connect to the endpoint URL"

I am trying to upload a large file to S3 bucket (~2.3 GB). The transfer starts but fails abruptly after some time. The first time I tried, I was able to upload successfully which should mean that the command works fine.

My command: aws s3 cp local\path\to\file s3://bucket/remotepath

This is what it looks like in progress for some time:

Completed 136.8 MiB/2.3 GiB (542.4 KiB/s) with 1 file(s) remaining

The upload starts and fails after some time with the exception:

upload failed: local\path\to\file to s3://bucket/remotepath Could not connect to the endpoint URL: "https://bucket.s3.us-east-1.amazonaws.com/remotepath?uploadId=someUploadId"

Credentials seem fine:

aws configure
AWS Access Key ID [****************XXXX]:
AWS Secret Access Key [****************XXXX]:
Default region name [us-east-1]:
Default output format [json]:

Internet connectivity is also consistent.

nslookup s3.amazonaws.com
Server:  modem.Home
Address:  192.168.0.1
Non-authoritative answer:
Name:    s3-1.amazonaws.com
Address:  52.X.X.X
Aliases:  s3.amazonaws.com

ping s3.amazonaws.com
Ping statistics for 52.X.X.X:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 77ms, Maximum = 84ms, Average = 80ms

Two questions:

How can I debug and find the reason for failure?
What can I do to make sure it works reliably?

Solution

Solution

Even with a stable and reasonably fast Internet connection, large file uploads to S3 via the aws cli can fail with this error if they max out the available upload bandwidth.

It can be resolved by tweaking some values in your aws cli config (~/aws/.config):

max_concurrent_requests - set this to a less ambitious number than the default 10 (I used 4).
max_bandwidth - reduce this to a number slightly less than the upload speed aws s3 reports when using defaults (in my case that was 1.2MB/s, so I set this value to 1MB/s).

Reasoning

I noticed that while an aws s3 upload was running, my Internet connection was unusable. Even loading a simple web page on a separate device would time out, as would a DNS lookup. This led me to suspect that aws s3 is so good at saturating upload bandwidth that it prevents outbound connections from completing successfully - including it's own.

Uploads via aws s3 are multi-part by default, meaning that files over a certain size (multipart_threshold) are broken into chunks which are uploaded separately and concurrently (up to max_concurrent_requests at a time). The combined bandwidth of these upload requests is capped at max_bandwidth.

I suspect that if max_bandwidth is >= the upload bandwidth of the Internet connection, eventually the connection becomes saturated and one of the new multi-part upload requests is unable to connect to S3, resulting in the Could not connect to the endpoint URL... error.

Limiting max_bandwidth is likely the key factor here. Reducing it ensures that some bandwidth is free for other outbound requests to complete. This includes not only aws s3's own concurrent upload requests but also anyone else who might be trying to use the Internet connection. If upload bandwidth is maxed out there's really no need for a huge number of concurrent connections and each new connection is a potential failure point. So reducing them via max_concurrent_requests makes sense too.

Also note that you can use --debug to get verbose debug output from aws s3.