I am writing a dictionary app. If a user types an Unicode character I want to check which language the character is.
e.g.
字 - returns ['zh', 'ja', 'ko']
العربية - returns ['ar']
a - returns ['en', 'fr', 'de'] //and many more
й - returns ['ru', 'be', 'bg', 'uk']
I searched and found that it could be done with CLDR https://stackoverflow.com/a/6445024/41948
Or Google API Python - can I detect unicode string language code?
But in my case
Any tips?
I need to use this in Python and Javascript. Thanks!
Would it be sufficient to narrow the glyph down to language families? If so, you could create a set of ranges (language -> code range) based on the mapping of BMP like the one shown at http://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane or the Scripts section of the Unicode charts page - http://www.unicode.org/charts/
Reliably determining parent language for glyphs is definitely more complicated because of the number of shared symbols. If you only need 80% accuracy, you could potentially adjust your ranges for certain languages to intentionally include/leave out certain characters if it simplifies your ranges.
Edit: I re-read through the question you referenced CLDR from and the first answer regarding code -> language mapping. I think that's definitely out of the question but the reverse seems feasible if a bit computationally expensive. With clever data structuring, you could identify language families and then drill down to the actual language ranges from there, reducing traversals through irrelevant language -> range pairs.