I'm using btoa
to encode a Uint8Array
to a base 64 string. And I hit a strange case. This works:
export function toBase64(data: Uint8Array): string {
return btoa(String.fromCharCode(...data))
}
Whereas this does not (btoa
will often complain about an unknown character):
export function toBase64(data: Uint8Array): string {
return btoa(new TextDecoder('latin1').decode(data))
}
What encoding should I use with TextDecoder
to produce the same string as via fromCharCode
?
Peacing together various documentation the following should be true:
btoa
expects a latin1
encodingString.fromCharCode
will convert individual integers to the respective utf16
characterlatin1
and utf16
overlapDoing some experiments it is clear the two approaches yield different strings. With this setup:
const array = Array.from({ length: 256 }, (_, i) => i);
const d = new Uint8Array(array);
Running:
String.fromCharCode(...d)
will yield
\x00\x01\x02\x03\x04\x05\x06\x07\b\t\n\v\f\r\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7F\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
Whereas running:
(new TextDecoder('latin1')).decode(d)
will yield
\x00\x01\x02\x03\x04\x05\x06\x07\b\t\n\v\f\r\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7F€\x81‚ƒ„…†‡ˆ‰Š‹Œ\x8DŽ\x8F\x90‘’“”•–—˜™š›œ\x9DžŸ ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
Where they substantially differ in the range 7F-9F
(copied below for clearity)
\x7F\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F
\x7F€\x81‚ƒ„…†‡ˆ‰Š‹Œ\x8DŽ\x8F\x90‘’“”•–—˜™š›œ\x9DžŸ
String.fromCharCode
takes in UTF-16 code-units, so you'd have to use an UTF-16 decoder to get the same result. However you also need to use an Uint16Array
to represent the data:
const array = Array.from({ length: 256 }, (_, i) => i);
const d = new Uint16Array(array);
const fromString = String.fromCharCode(...d);
const decoded = (new TextDecoder("UTF-16le")).decode(d);
console.log(fromString);
console.log(decoded);
console.log(fromString === decoded);
Note that on Big-Endian machines you might have to use a "UTF-16be"
instead, or to generate the buffer through a DataView
, though I couldn't test it myself and I'm not sure how many such machines crawl the modern web.