Search code examples
pythonnltkstemminglemmatization

Smart stemming/lemmatizing in Python for Nationalities


I am working with Python, and I would like to find the roots of some words, that mainly refer to countries. Some examples that demonstrate what I need are:

  • Spanish should give me Spain.
  • English should give me England.
  • American should give me America.
  • Nigerian should give me Nigeria.
  • Greeks (plural) should give me Greece.
  • Puerto Ricans (plural) should give me Puerto Rico.
  • Portuguese should give me Portugal.

I have experimented a bit with the Porter, Lancaster and Snowball stemmers of the NLTK module. But Porter and Snowball do not change the tokens at all, while Lancaster is too aggressive. For example, the Lancaster stem of American is "Am", which is pretty badly butchered.I have also played some with the WordNet lemmatizer, with no success.

Is there a way to get the above results, even if it only works for countries?


Solution

  • You might want to check out Unicode's CLDR (Common Locale Data Repository): http://cldr.unicode.org/

    It has lists of territories and languages that might be useful as you could map them together using their shared standard ISO 639 codes (en, de, fr etc).

    Here's a useful JSON repository:

    https://github.com/unicode-cldr/cldr-localenames-full/tree/master/main/en

    Check out the territories.json and languages.json files there.