There's tons of info about Unicode codeunits, codepoints, etc, but I'm still a bit fuzzy about converting combined characters, graphemes, etc using byte-streams (required by libiconv).
Currently I'm only interested in converting between UTF-8/UTF-16/UTF-32 using libconv's iconv()
, which expects the byte-lengths of both source and destination buffers as arguments.
Question: Is there a safe way to calculate fast the maximum possible bytes-length of the target buffer, based on the already known bytes-length of the source buffer?
Let's say for example, converting from u16buf
to u8buf
with a known u16byteslen
(excluding 0x0000-termination if any). In the worst-case scenario, there will be 1 two-byte unit per codepoint in the UTF-16 source buffer, corresponding to a 4 single-byte units per codepoint in the UTF-8 target buffer. Is that enough to safely assume that the UTF-8 target buffer can never be longer than 2 * u16lenbytes
?
I've actually experimented with that and seems to work, but I'm not sure if I'm missing corner cases involving combined characters and grapheme clusters. My doubts come from my ignorance regarding how those things are converted across these 3 different encodings. I mean, is it possible for a grapheme to need say 3 UTF-16 codepoints but like 10 UTF-8 codepoints when converted?
In that case, doubling u16lenbytes
wouldn't suffice, right? And if so, is there any other straight forward way to pre-calc the maximum length of the target buffer?
Question: Is there a safe way to calculate fast the maximum possible bytes-length of the target buffer, based on the already known bytes-length of the source buffer?
Yes.
to UTF-8 | to UTF-16 | to UTF-32 | |
---|---|---|---|
from UTF-8 | ×2 | ×4 | |
from UTF-16 | ×1 ½ | ×1 | |
from UTF-32 | ×1 | ×1 |
You can calculate this yourself by breaking it down by code-point ranges. Pick a source and destination column, and find the largest ratio.
Code Point | UTF-8 length | UTF-16 length | UTF-32 length |
---|---|---|---|
0000…007F | 1 | 2 | 4 |
0080…07FF | 2 | 2 | 4 |
0800…FFFF | 3 | 2 | 4 |
10000…10FFFF | 4 | 4 | 4 |
Combining characters and grapheme clusters do not affect anything. Encodings simply convert a sequence of Unicode scalar values to bytes, and they are very straightforward.
Note that you will need to add two extra bytes when converting to UTF-16, and four extra bytes when converting to UTF-32, since these encodings will add a BOM U+FEFF to the beginning of the text. (If you don’t want that, use one of the BOM-less encodings, like UTF-16BE
or UTF-16LE
.)
I mean, is it possible for a grapheme to need say 3 UTF-16 codepoints but like 10 UTF-8 codepoints when converted?
No. That would imply some other kind of conversion, like a decomposition. The number of scalar values input is equal to the number of scalar values output, with the possible addition of U+FEFF byte order mark at the beginning. (I say "scalar value" instead of "code point", because "scalar value" excludes surrogates. If you are transcoding text which might have errors or might be garbage data, it doesn’t change the size of the result.)