Considering charAt()
, charCodeAt()
, and codePointAt()
I find a discrepancy between what the parameter means. Before I really thought about it I thought you would always be safe to access the character at length
-1. But I read the difference between charCodeAt() and codePointAt() is that charCodeAt() refers to 16-bit (byte pairs) so besides reading i
you would also need i+1
if they were surrogate pairs (as is the methodology with UTF-16). Whereas codePointAt() needs a parameter that references the UTF-8 character position (zero based). So now I'm in a quandary as to whether length
counts the number of characters, or the number of byte pairs UTF-16 style. I believe JavaScript holds strings as UTF-16, but using length
-1 from that on a string that had lots of 4-byte characters with the codePointAt()
function would be off the end of the string!!
The length
of strings is counted in 16-bit unsigned integer values (“elements”) or code units (which together form a valid or invalid UTF16 code unit sequence), and so are its indices. We might also call them "characters".
It doesn't matter whether you access them as properties or via charAt
, chatCodeAt
and codePointAt
, length - 1
will always be a valid index. A code point might however be encoded as a surrogate pair spanning two indices. There is no builtin method to measure the number of these, but the default string iterator will yield them so you can count them using a for … of
loop.