Search code examples
javascriptunicodebtoa

btoa behavior with multibyte characters


btoa expects an input string representing binary data, which it will then base64 encode.

However, if the input contains multibyte characters it will throw an error because btoa only directly supports input characters within a the Latin1 range of Unicode.

const ok = "a";
console.log(ok.codePointAt(0).toString(16)); //   61: occupies < 1 byte

const notOK = "✓";
console.log(notOK.codePointAt(0).toString(16)); // 2713: occupies > 1 byte

console.log(btoa(ok)); // YQ==
console.log(btoa(notOK)); // error

But why is this the case? Why couldn't btoa simply treat the input string as a sequence of bytes and encode each byte one by one, ignoring what the bytes mean?


Solution

  • btoa means binary to ASCII and a string containing multi-byte characters is not binary data in the ECMAScript spec. This demands browsers implement their internal character encoding in a certain way. The constraint is intentional.

    There's a great explanation here: https://mathiasbynens.be/notes/javascript-encoding. Likely, this is something that can't be changed, as it would cause backward incompatible changes. It's fully baked now and was decided a long time ago before Unicode was even popular and probably when you have people with machines that were only 16 bit.

    The implementation also uses quite complex bitwise operators, most likely for performance reasons. They can only do this because of the constraints.

    Could there be a new implementation that isn't bound to the internal character encoding? Yes, but it would need to be called something else. I'm pretty sure it was discussed on the TextEncoder API but dropped and there was some discussion about if it could be a method on ByteArray. But as far as I know, it doesn't exist yet.