Search code examples
javascripthtmlencodingutf-8character-encoding

If <meta charset=“utf-8”> means that JavaScript is using utf-8 encoding instead of utf-16


I have been trying to understanding why the need for encoding/decoding to UTF-8 happens all over the place in JavaScript land, and learned that JavaScript uses UTF-16 encoding.

Let’s talk about Javascript string encoding

So I'm assuming that's why a library such as utf8.js exists, to convert between UTF-16 and UTF-8.

But then at the end he provides some insights:

Encoding in Node is extremely confusing, and difficult to get right. It helps, though, when you realize that Javascript string types will always be encoded as UTF-16, and most of the other places strings in RAM interact with sockets, files, or byte arrays, the string gets re-encoded as UTF-8.

This is all massively inefficient, of course. Most strings are representable as UTF-8, and using two bytes to represent their characters means you are using more memory than you need to, as well as paying an O(n) tax to re-encode the string any time you encounter a HTTP or filesystem boundary.

That reminded me of the <meta charset=“utf-8”> in the HTML <head>, which I never really thought too much about, other than "you need this to get text working properly".

Now I'm wondering, which this question is about, if that <meta charset=“utf-8”> tag tells JavaScript to do UTF-8 encoding. That would then mean that when you create strings in JavaScript, they would be UTF-8 encoded rather than UTF-16. Or if I'm wrong there, what exactly it is doing. If it is telling JavaScript to use UTF-8 encoding instead of UTF-16 (which I guess would be considered the "default"), then that would mean you don't need to be paying that O(n) tax over doing conversions between UTF-8 and UTF-16, which would mean a performance improvement. Wondering if I am understanding correctly, or if not, what I am missing.


Solution

  • Charset in meta

    The <meta charset=“utf-8”> tag tells HTML (less sloppily: the HTML parser) that the encoding of the page is utf8.

    JS does not have a built-in facility to switch between different encondings of strings - it is always utf-16.

    Asymptotic bounds

    I do not think that there is a O(n) penalty for encoding conversions. Whenever this kind of encoding change is due, there already is an O(n) operation: reading/writing the data stream. So any fixed number of operations on each octet would still be O(n). Encoding change requires local knowledge only, ie. a look-ahead window of fixed length only, and can thus be incorporated in the stream read/write code with a penalty of O(1).

    You could argue that the space penalty is O(n), though if there is the need to store the string in any standard encoding (ie. without compression), the move to utf-16 means a factor of 2 at max thus staying within the O(n) bound.

    Constant factors

    Even if the concern is minimizing the constant factors hidden in O(n) notation encoding change have a modest impact, in the time domain at least. Writing/reading a utf-16 stream as utf-8 for the most part of (Western) textual data means skipping every second octet / inserting null octets. That performance hit pales in comparison with the overhead and the latency stemming from interfacing with a socket or the file system.

    Storage is different, of course, though storage is comparatively cheap today and the upper bound of 2 still holds. The move from 32 to 64 bit has a higher memeory impact wrt to number representations and pointers.