Search code examples
rubyencodingutf-8utf-16file-io

How to tell if a UTF-8 file has asian characteres?


Question: Is there a simple way to discover if a given UTF file has or does not have not Asian characters? Would be great if that works with both UTF-8 and UTF-16. Better yet if done with ruby instead of a generic algorithm.

EDIT: By the comments I learn about CJK, that is most likely what I'm looking for.

So, is there a way to test if a UTF file have CJK characters?


Solution

  • This may be reinventing the wheel but you can use unpack('U*') to get the unicode codepoints from any string. IE

       codepoints = '㌂'.unpack('U*').first
        => 13058 
    

    Then you can use .any?

     codepoints.any?{|c| overlaps_cjk?(c)}
    

    The overlaps_cjk function you can derive by getting all the desired codepoint blocks you consider "asian characters" from http://graphemica.com/blocks

    for instance:

     CJK_CODEPOINTS = [(13000..13500)]
     def overlaps_cjk?(codepoint)
       CJK_CODEPOINTS.any?{|range| range.cover?(codepoint)}
     end