Search code examples
pythonnlpnltkwordnetlemmatization

Getting the root word using the Wordnet Lemmatizer


I need to find a common root word matched for all related words for a keyword extractor.

How to convert words into the same root using the python nltk lemmatizer?

  • Eg:
    1. generalized, generalization -> general
    2. optimal, optimized -> optimize (maybe)
    3. configure, configuration, configured -> configure

The python nltk lemmatizer gives 'generalize', for 'generalized' and 'generalizing' when part of speech(pos) tag parameter is used but not for 'generalization'.

Is there a way to do this?


Solution

  • Use SnowballStemmer:

    >>> from nltk.stem.snowball import SnowballStemmer
    >>> stemmer = SnowballStemmer("english")
    >>> print(stemmer.stem("generalized"))
    general
    >>> print(stemmer.stem("generalization"))
    general
    

    Note: Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech.

    A general issue I have seen with lemmatizers is that it identifies even bigger words as lemmas.

    Example: In WordNet Lemmatizer(checked in NLTK),

    • Genralized => Generalize
    • Generalization => Generalization
    • Generalizations => Generalization

    POS tag was not given as input in the above cases, so it was always considered noun.