Search code examples
javascriptarraysutf-8utf-16

What is a safe length of JavaScript strings?


Considering charAt(), charCodeAt(), and codePointAt() I find a discrepancy between what the parameter means. Before I really thought about it I thought you would always be safe to access the character at length-1. But I read the difference between charCodeAt() and codePointAt() is that charCodeAt() refers to 16-bit (byte pairs) so besides reading i you would also need i+1 if they were surrogate pairs (as is the methodology with UTF-16). Whereas codePointAt() needs a parameter that references the UTF-8 character position (zero based). So now I'm in a quandary as to whether length counts the number of characters, or the number of byte pairs UTF-16 style. I believe JavaScript holds strings as UTF-16, but using length-1 from that on a string that had lots of 4-byte characters with the codePointAt() function would be off the end of the string!!


Solution

  • The length of strings is counted in 16-bit unsigned integer values (“elements”) or code units (which together form a valid or invalid UTF16 code unit sequence), and so are its indices. We might also call them "characters".

    It doesn't matter whether you access them as properties or via charAt, chatCodeAt and codePointAt, length - 1 will always be a valid index. A code point might however be encoded as a surrogate pair spanning two indices. There is no builtin method to measure the number of these, but the default string iterator will yield them so you can count them using a for … of loop.