Search code examples
curlbrowserutf-8jax-rsbyte-order-mark

Preserve UTF-8 BOM in Browser Downloads


I have a JAX-RS REST-Service that produces a CSV file and streams it back to the browser. Everything is set to UTF-8, so also the file I download via the browser is a valid UTF-8 File (without a BOM) that shows me valid, readable UTF-8 umlauts, etc. in Notepad++, Sublime, etc..

Opening such a file in Excel though leads to unreadable umlauts, etc. since Excel apparently tries to open it with another charset (CP-1252, I guess, but that doesn't really matter).

Saving the file with a BOM via Notepad++ and re-opening it in Excel works nicely. Seems like the detection of a BOM is the only way that Excel uses to detect UTF-8. Anyways - I thought that adding a BOM could help...

Did that. Same result. After a while, I figured out that the BOM gets removed under some circumstances: If I added any character right before the BOM, I could see the BOM in my Hex-Editor. After removal of that character, the BOM wouldn't be there anymore.

When I went on and downloaded the file via cURL I was really surprised. The BOM was there! Up until that I thought it might have to do with my application, Content-Types, Encodigs, HTTP Headers, etc. - but all of them seem to be fine.

Now, after hours of trying out different things, any ideas on how I can tell the browser to keep the BOM? Is there any HTTP Header I could set? Since Chrome, Internet Explorer, Edge, Firefox all remove the BOM, this sounds a little bit like a browser convention to me...

Many thanks for your highly appreciated help!

EDIT: Thanks to sideshowbarker answer, I found a workaround by prepending two BOMs to the content, so there will be a BOM remaining after the first BOM gets removed by the browser.


Solution

  • Workaround (from comments): Since only the first three bytes are read, you can prepend two BOMs to the source, which will result in the downloaded file being valid UTF-8 with a BOM.

    As far as Excel specifically: Per the answer at https://stackoverflow.com/a/16766198/1143392, newer versions of Excel (from Office 365) do now support UTF-8.

    As far as the cause of the behavior described in the question: The cause is, the relevant specs require the BOM to be stripped out, and that’s what browsers do. That is, browsers conform to the requirements of the UTF-8 decode algorithm in the Encoding spec, which is this:

    To UTF-8 decode a byte stream stream, run these steps:

    1. Let buffer be an empty byte sequence.

    2. Read three bytes from stream into buffer.

    3. If buffer does not match 0xEF 0xBB 0xBF, prepend buffer to stream.

    4. Let output be a code point stream.

    5. Run UTF-8’s decoder with stream and output.

    6. Return output.

    Step 3 is what causes the BOM to be stripped.

    Given the Encoding spec requires that, I think there’s no way to tell browsers to keep the BOM.