Question: Is there a simple way to discover if a given UTF file has or does not have not Asian characters? Would be great if that works with both UTF-8 and UTF-16. Better yet if done with ruby instead of a generic algorithm.
EDIT: By the comments I learn about CJK, that is most likely what I'm looking for.
So, is there a way to test if a UTF file have CJK characters?
This may be reinventing the wheel but you can use unpack('U*')
to get the unicode codepoints from any string. IE
codepoints = '㌂'.unpack('U*').first
=> 13058
Then you can use .any?
codepoints.any?{|c| overlaps_cjk?(c)}
The overlaps_cjk function you can derive by getting all the desired codepoint blocks you consider "asian characters" from http://graphemica.com/blocks
for instance:
CJK_CODEPOINTS = [(13000..13500)]
def overlaps_cjk?(codepoint)
CJK_CODEPOINTS.any?{|range| range.cover?(codepoint)}
end