Search code examples
encodingcharacter-encodingdetection

Detect presence of a specific charset


I need a way to detect whether a file contains characters from a certain charset.

Specifically, I want to detect the presence of UTF8-encoded cyrillic characters in a series of files. Is there a tool to do this?

Thanks


Solution

  • If you are looking for ready solution, you might want to try Enca.

    However, if you only want to detect presence of what can be possibly decoded as UTF-8 Cyrillic characters (without any complete UTF-8 validity checks), you just have to grep for something like /(\xD0[\x81\x90-\xBF]|\xD1[\x80-\x8F\x91]){n,}/ (this exact regexp is for n subsequent UTF8-encoded Russian Cyrillic characters). For additional check that the whole file contains only valid UTF-8 data you can use something like isutf8(1).

    Both methods have their good and bad sides and may sometimes give wrong results.