Search code examples
htmlunicodeescapingdata-uri

Multi-byte unicode decoding incorrectly from data uri


I recently had a problem with html data URIs:

My source html included the character ā, which rendered correctly when the html was loaded directly. However, when the html was converted to a data URI, the character instead rendered as Ä.

After digging through the resulting URI, I found that the character had been encoded as %c4%81, but this seems to be the correct URI encoding of ā.

I even tried converting the data URI to base64, but I got the same issue. This seems to happen on both Chrome and Safari.

I'm wondering if it is a problem with encoding multi-byte unicode characters in data URIs, because ā is currently the only multi-byte character in my html.

console.log(encodeURIComponent('ā'));

// https://stackoverflow.com/questions/23223718/failed-to-execute-btoa-on-window-the-string-to-be-encoded-contains-characte
console.log(btoa(unescape(encodeURIComponent('ā'))));
<iframe src="data:text/html,%c4%81"></iframe>
<iframe src="data:text/html;base64,xIE="></iframe>


Solution

  • You need to specify your character encoding when working with text data URIs, most commonly UTF-8.

    If you simply add a ;charset=UTF-8 declaration to your mime type, the browser will decode the character correctly.

    <iframe src="data:text/html;charset=UTF-8,%c4%81"></iframe>
    <iframe src="data:text/html;charset=UTF-8;base64,xIE="></iframe>