Search code examples
javascriptencoding

What is the encoding used by `String.fromCharCode`?


I'm using btoa to encode a Uint8Array to a base 64 string. And I hit a strange case. This works:

export function toBase64(data: Uint8Array): string {
    return btoa(String.fromCharCode(...data))
}

Whereas this does not (btoa will often complain about an unknown character):

export function toBase64(data: Uint8Array): string {
    return btoa(new TextDecoder('latin1').decode(data))
}

Question

What encoding should I use with TextDecoder to produce the same string as via fromCharCode?

Background

Peacing together various documentation the following should be true:

  • btoa expects a latin1 encoding
  • String.fromCharCode will convert individual integers to the respective utf16 character
  • for the first 256 characters latin1 and utf16 overlap

Test

Doing some experiments it is clear the two approaches yield different strings. With this setup:

const array = Array.from({ length: 256 }, (_, i) => i);
const d = new Uint8Array(array);

Running:

String.fromCharCode(...d)

will yield

\x00\x01\x02\x03\x04\x05\x06\x07\b\t\n\v\f\r\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7F\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ

Whereas running:

(new TextDecoder('latin1')).decode(d)

will yield

\x00\x01\x02\x03\x04\x05\x06\x07\b\t\n\v\f\r\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7F€\x81‚ƒ„…†‡ˆ‰Š‹Œ\x8DŽ\x8F\x90‘’“”•–—˜™š›œ\x9DžŸ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ

Where they substantially differ in the range 7F-9F (copied below for clearity)

\x7F\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F

\x7F€\x81‚ƒ„…†‡ˆ‰Š‹Œ\x8DŽ\x8F\x90‘’“”•–—˜™š›œ\x9DžŸ

Solution

  • String.fromCharCode takes in UTF-16 code-units, so you'd have to use an UTF-16 decoder to get the same result. However you also need to use an Uint16Array to represent the data:

    const array = Array.from({ length: 256 }, (_, i) => i);
    const d = new Uint16Array(array);
    const fromString = String.fromCharCode(...d);
    const decoded = (new TextDecoder("UTF-16le")).decode(d);
    console.log(fromString);
    console.log(decoded);
    console.log(fromString === decoded);

    Note that on Big-Endian machines you might have to use a "UTF-16be" instead, or to generate the buffer through a DataView, though I couldn't test it myself and I'm not sure how many such machines crawl the modern web.