I'm working on character set detection using ICU, via another library that includes it, but it does not have converters for all character sets it can detect. For example, there is a converter for ISO-8859-1
, but not for ISO-8859-2
.
I've tried a couple of things, such as using ucnv_getAvailableName
, but it returns names of converters, which don't seem to work with uscdet_setDetectableCharset
(unless I made a mistake).
Thus, my question: how to filter the charset detection to the available converters?
I was also wondering if there was a way to bias the detection towards UTF-8
(apart from looking through all charset detection results), e.g. for files detected as ISO-8859-1
even though all characters in the file can be encoded in UTF-8
.
(unless I made a mistake)
I made a mistake.
ucsdet_setDetectableCharset
sets the status to failure for charsets that it can not detect (logical). I did not reset the failure status, expecting the functions to set the correct status (i.e. success in case of success); however, this is not how ICU works and I forgot about that.
Resetting the status gives me some overlap between detectable and convertible.