Rough Unicode -> Language Code without CLDR?

I am writing a dictionary app. If a user types an Unicode character I want to check which language the character is.

e.g.

字 - returns ['zh', 'ja', 'ko'] 
العربية - returns ['ar']
a - returns ['en', 'fr', 'de'] //and many more
й - returns ['ru', 'be', 'bg', 'uk']

I searched and found that it could be done with CLDR https://stackoverflow.com/a/6445024/41948

Or Google API Python - can I detect unicode string language code?

But in my case

Looking up a large charmap db seems cost a lot of storage and memory
Too slow to call an API, besides it requires a network connection
don't need to be very accurate. just about 80% correct ratio is acceptable
simple & fast is the main requirement
it's OK to just cover UCS2 BMP characters.

Any tips?

I need to use this in Python and Javascript. Thanks!

Solution

Would it be sufficient to narrow the glyph down to language families? If so, you could create a set of ranges (language -> code range) based on the mapping of BMP like the one shown at http://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane or the Scripts section of the Unicode charts page - http://www.unicode.org/charts/

Reliably determining parent language for glyphs is definitely more complicated because of the number of shared symbols. If you only need 80% accuracy, you could potentially adjust your ranges for certain languages to intentionally include/leave out certain characters if it simplifies your ranges.

Edit: I re-read through the question you referenced CLDR from and the first answer regarding code -> language mapping. I think that's definitely out of the question but the reverse seems feasible if a bit computationally expensive. With clever data structuring, you could identify language families and then drill down to the actual language ranges from there, reducing traversals through irrelevant language -> range pairs.