I am working with Python, and I would like to find the roots of some words, that mainly refer to countries. Some examples that demonstrate what I need are:
I have experimented a bit with the Porter, Lancaster and Snowball stemmers of the NLTK module. But Porter and Snowball do not change the tokens at all, while Lancaster is too aggressive. For example, the Lancaster stem of American is "Am", which is pretty badly butchered.I have also played some with the WordNet lemmatizer, with no success.
Is there a way to get the above results, even if it only works for countries?
You might want to check out Unicode's CLDR (Common Locale Data Repository): http://cldr.unicode.org/
It has lists of territories and languages that might be useful as you could map them together using their shared standard ISO 639 codes (en, de, fr etc).
Here's a useful JSON repository:
https://github.com/unicode-cldr/cldr-localenames-full/tree/master/main/en
Check out the territories.json and languages.json files there.