I have been trying to understanding why the need for encoding/decoding to UTF-8 happens all over the place in JavaScript land, and learned that JavaScript uses UTF-16 encoding.
Let’s talk about Javascript string encoding
So I'm assuming that's why a library such as utf8.js exists, to convert between UTF-16 and UTF-8.
But then at the end he provides some insights:
Encoding in Node is extremely confusing, and difficult to get right. It helps, though, when you realize that Javascript string types will always be encoded as UTF-16, and most of the other places strings in RAM interact with sockets, files, or byte arrays, the string gets re-encoded as UTF-8.
This is all massively inefficient, of course. Most strings are representable as UTF-8, and using two bytes to represent their characters means you are using more memory than you need to, as well as paying an O(n) tax to re-encode the string any time you encounter a HTTP or filesystem boundary.
That reminded me of the <meta charset=“utf-8”>
in the HTML <head>
, which I never really thought too much about, other than "you need this to get text working properly".
Now I'm wondering, which this question is about, if that <meta charset=“utf-8”>
tag tells JavaScript to do UTF-8 encoding. That would then mean that when you create strings in JavaScript, they would be UTF-8 encoded rather than UTF-16. Or if I'm wrong there, what exactly it is doing. If it is telling JavaScript to use UTF-8 encoding instead of UTF-16 (which I guess would be considered the "default"), then that would mean you don't need to be paying that O(n)
tax over doing conversions between UTF-8 and UTF-16, which would mean a performance improvement. Wondering if I am understanding correctly, or if not, what I am missing.
Charset in meta
The <meta charset=“utf-8”>
tag tells HTML (less sloppily: the HTML parser) that the encoding of the page is utf8.
JS does not have a built-in facility to switch between different encondings of strings - it is always utf-16.
Asymptotic bounds
I do not think that there is a O(n)
penalty for encoding conversions. Whenever this kind of encoding change is due, there already is an O(n)
operation: reading/writing the data stream. So any fixed number of operations on each octet would still be O(n)
. Encoding change requires local knowledge only, ie. a look-ahead window of fixed length only, and can thus be incorporated in the stream read/write code with a penalty of O(1)
.
You could argue that the space penalty is O(n)
, though if there is the need to store the string in any standard encoding (ie. without compression), the move to utf-16 means a factor of 2 at max thus staying within the O(n)
bound.
Constant factors
Even if the concern is minimizing the constant factors hidden in O(n)
notation encoding change have a modest impact, in the time domain at least. Writing/reading a utf-16 stream as utf-8 for the most part of (Western) textual data means skipping every second octet / inserting null octets. That performance hit pales in comparison with the overhead and the latency stemming from interfacing with a socket or the file system.
Storage is different, of course, though storage is comparatively cheap today and the upper bound of 2 still holds. The move from 32 to 64 bit has a higher memeory impact wrt to number representations and pointers.