Search code examples
cutf-8internationalizationwidestring

Using narrow string manipulation functions on wide data


I'm parsing an XML file which can contain localized strings in different languages (at the moment its just english and spanish, but in the future it could be any language), the API for the XML parser returns all data within the XML via a char* which is UTF8 encoded.

Some manipulation of the data is required after its been parsed (searching within it for substrings, concatenating strings, determining the length of substrings etc.).

It would be convenient to use standard functions such as strlen, strcat etc. As the raw data I'm receiving from the XML parser is a char* I can do all manipulation readily using these standard string handling functions.

However these all of course make the assumption and requirement that the strings are NULL terminated. My question therefore is - if you have wide data represented as a char*, can a NULL terminator character occur within the data rather than at the end?

i.e. if a character in a certain language doesn't require 2 bytes to represent it, and it is represented in one byte, will/can the other byte be NULL?


Solution

  • UTF-8 is not "wide". UTF-8 is multibyte encoding, where Unicode character can take 1 to 4 bytes. UTF-8 won't have zero terminators inside valid character. Make sure you are not confused on what your parser is giving you. It could be UTF-16 or UCS2 or their 4-byte equivalents placed in wide character strings, in which case you have to treat them as wide strings.