Search code examples
amazon-web-servicesamazon-s3curlwget

Downloading text presigned s3 URL with wget returns binary


I'm trying to programmatically download a pre-signed S3 URL. I know that the file I'm downloading is an ASCII-text file. When downloading the URL by copy-paste into Chrome, the file is indeed as I would expect (see below). However, with wget the downloaded file is binary.

Looking into previous posts about this, unfortunately I couldn't find much that helped me. The posts suggest to add quotes around the URL, but my URL does not contain special characters. Some of the posts I checked: Amazon AWS S3 signed URL via Wget, https://superuser.com/questions/1311516/curl-can-not-download-file-but-browser-can. (I actually double-checked anyway with double and single quotes, neither worked in my case).

➜  wget --no-check-certificate --no-proxy  "https://s3.eu-central-1.amazonaws.com/.../text_file.txt"
--2022-07-28 10:49:57--  https://s3.eu-central-1.amazonaws.com/.../text_file.txt
Resolving s3.eu-central-1.amazonaws.com (s3.eu-central-1.amazonaws.com)... 52.219.75.159
Connecting to s3.eu-central-1.amazonaws.com (s3.eu-central-1.amazonaws.com)|52.219.75.159|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21110 (21K) [binary/octet-stream]
Saving to: ‘text_file.txt’

text_file.txt                                     100%[===========================================================================================================>]  20.62K  --.-KB/s    in 0.004s  

2022-07-28 10:49:57 (5.61 MB/s) - ‘text_file.txt’ saved [21110/21110]

➜  file text_file.txt                                                                                                                                           
text_file.txt: data
➜  cat text_file.txt | head -n 1
[78!???ÊBz?j????X?????x>??_uߩi??a?Qqax?W?ϴ??_c????H???u?c??}???U??5?M?|A?-9?H?Y??\?՟??B?l
2ɯL????:?JZF㽬???,2?gn????Y~vU?l4?O`?!???r                                               ?h?1?]??f???
                                          ?MIUM??_??q?u?dC???v?MbcI>?R??oV???&?
# Following lines are for a file downloaded by copy-paste of the URL to a Chrome window
➜  file text_file\ \(1\).txt 
text_file (1).txt: ASCII text
➜  cat text_file\ \(1\).txt| head -n 1 
# Header of file

Solution

  • The content you have is likely compressed in S3. When a file is compressed using a common compression like GZip, Brotli, LZW, or Zlib and marked with the appropriate content encoding, most browsers will decompress the file on the fly, either for display or download.

    For instance, if we upload a simple HTML file, but compress it:

    $ cat example_file.html | brotli | \ 
        aws s3 cp - s3://example-bucket/example_html_br.html \
        --acl=public-read --content-encoding br
    

    Then we can view the contents in the browser, the browser engine is decompressing the file:

    Browser showing HTML file

    But attempting to download the file from WGet shows the compressed contents:

    $ wget -qO- https://example-bucket.s3.amazonaws.com/example_html_br.html | hexdump -C
    00000000  1f 6e 00 00 1d 07 ee be  1d 1b 46 77 12 aa 15 78  |.n........Fw...x|
    00000010  a8 dc d4 d4 5b 83 cc a0  a5 81 96 1c b0 b7 d5 6d  |....[..........m|
    00000020  29 46 f6 fa 6e 63 eb 29  ea aa 82 c8 25 a8 42 91  |)F..nc.)....%.B.|
    00000030  ce 1d 07 f6 06 e1 52 0f  f4 4a a9 d6 87 17 76 ff  |......R..J....v.|
    00000040  e1 da 01                                          |...|
    

    You can verify this by looking at the HTTP headers:

    $ wget -S https://example-bucket.s3.amazonaws.com/example_html_br.html
    --2022-08-01 14:10:40--  https://example-bucket.s3.amazonaws.com/example_html_br.html
    Resolving example-bucket.s3.amazonaws.com (example-bucket.s3.amazonaws.com)... 52.218.178.75
      [...]
      HTTP/1.1 200 OK
      Content-Encoding: br
    

    Here showing the content-encoding that the browser triggers off of. Either you'll need to ensure that whatever component that places this content in S3 in the first place doesn't compress it, or if you want to download the content, then you'll need to decompress it as the browser does:

    wget -qO- https://example-bucket.s3.amazonaws.com/example_html_br.html | brotli -df
    <html>
    <head>
    <title>Example</title>
    [...]
    

    The same premise holds true if you're using pre-signed URLs.