Detect presence of a specific charset

I need a way to detect whether a file contains characters from a certain charset.

Specifically, I want to detect the presence of UTF8-encoded cyrillic characters in a series of files. Is there a tool to do this?

Thanks

Solution

If you are looking for ready solution, you might want to try Enca.

However, if you only want to detect presence of what can be possibly decoded as UTF-8 Cyrillic characters (without any complete UTF-8 validity checks), you just have to grep for something like /(\xD0[\x81\x90-\xBF]|\xD1[\x80-\x8F\x91]){n,}/ (this exact regexp is for n subsequent UTF8-encoded Russian Cyrillic characters). For additional check that the whole file contains only valid UTF-8 data you can use something like isutf8(1).

Both methods have their good and bad sides and may sometimes give wrong results.

How to encode diacritics in JSON-LD?
encoding issues in R
How to encode data to remove any 0x00 bytes
How do I check if a string is unicode or ascii?
UTF-8 encoding of application.properties attributes in Spring-Boot
How can I configure encoding in Maven?
Using Binary Values from AWS Console for DynamoDB
Maven writes \u0000 into resolver-status.properties files, failing subsequent builds
How to convert UTF8 to EUC-JP on the browser?
Windows-1252 to UTF-8 encoding
How to read the file without encoding and extract desired urls with python3?
FFmpeg - Down-mix AC3 5.1 to Fraunhofer FDK ACC 2.1
Python 2.7 reading and writing "éèàçê" from utf-8 file
How to encode all logged messages as utf-8 in Python
POST request with data in body with Alamofire 4
Microsoft CHM contents -- how to view them?
How to convert Turkish chars to English chars in a string?
PHP gzcompress encoding issue
tee with utf-8 encoding
Can the 'auto' algorithm of the Symfony Security encoder change encoding method?
Working with SET NAMES utf8mb4 with utf8 tables
link: issue with symbol " (might be an encoding issue)
Decoding UPS Maxicode
How to give column names after one-hot encoding with sklearn?
MD5 Generates 31 character hash in Java
What is this vector overlay format?
Encoding issue: £ pound symbol appearing as <?> symbol
How to disable WordPress encoding special chars
Powershell: Setting Encoding for Get-Content Pipeline
Unable to parse non-ASCII characters from exported Instagram chat log