Search code examples
javascriptpythonunicodecldr

Rough Unicode -> Language Code without CLDR?


I am writing a dictionary app. If a user types an Unicode character I want to check which language the character is.

e.g.

字 - returns ['zh', 'ja', 'ko'] 
العربية - returns ['ar']
a - returns ['en', 'fr', 'de'] //and many more
й - returns ['ru', 'be', 'bg', 'uk']

I searched and found that it could be done with CLDR https://stackoverflow.com/a/6445024/41948

Or Google API Python - can I detect unicode string language code?

But in my case

  • Looking up a large charmap db seems cost a lot of storage and memory
  • Too slow to call an API, besides it requires a network connection
  • don't need to be very accurate. just about 80% correct ratio is acceptable
  • simple & fast is the main requirement
  • it's OK to just cover UCS2 BMP characters.

Any tips?

I need to use this in Python and Javascript. Thanks!


Solution

  • Would it be sufficient to narrow the glyph down to language families? If so, you could create a set of ranges (language -> code range) based on the mapping of BMP like the one shown at http://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane or the Scripts section of the Unicode charts page - http://www.unicode.org/charts/

    Reliably determining parent language for glyphs is definitely more complicated because of the number of shared symbols. If you only need 80% accuracy, you could potentially adjust your ranges for certain languages to intentionally include/leave out certain characters if it simplifies your ranges.

    Edit: I re-read through the question you referenced CLDR from and the first answer regarding code -> language mapping. I think that's definitely out of the question but the reverse seems feasible if a bit computationally expensive. With clever data structuring, you could identify language families and then drill down to the actual language ranges from there, reducing traversals through irrelevant language -> range pairs.