Search code examples
c++unicodecharacter-encodingicu

ICU: How to filter the charset detection to the available converters?


I'm working on character set detection using ICU, via another library that includes it, but it does not have converters for all character sets it can detect. For example, there is a converter for ISO-8859-1, but not for ISO-8859-2.

I've tried a couple of things, such as using ucnv_getAvailableName, but it returns names of converters, which don't seem to work with uscdet_setDetectableCharset (unless I made a mistake).

Thus, my question: how to filter the charset detection to the available converters?

I was also wondering if there was a way to bias the detection towards UTF-8 (apart from looking through all charset detection results), e.g. for files detected as ISO-8859-1 even though all characters in the file can be encoded in UTF-8.


Solution

  • (unless I made a mistake)

    I made a mistake.

    ucsdet_setDetectableCharset sets the status to failure for charsets that it can not detect (logical). I did not reset the failure status, expecting the functions to set the correct status (i.e. success in case of success); however, this is not how ICU works and I forgot about that.

    Resetting the status gives me some overlap between detectable and convertible.