Search code examples
pythonpandasnltkwordnetlemmatization

WordNetlemmatizer error - all alphabets are lemmatized


I am trying to lemmatize my dataset for sentiment analysis - What should I do to get the expected output rather than the current output? Input file is a csv - stored as DataFrame object.

dataset = pd.read_csv('xyz.csv')

Here is my code

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
list1_ = []
for file_ in dataset:
    result1 = dataset['Content'].apply(lambda x: [lemmatizer.lemmatize(y) for y in x])
    list1_.append(result1)
dataset = pd.concat(list1_, ignore_index=True)

Expected

>> lemmatizer.lemmatize('cats')
>> [cat]

Current output

>> lemmatizer.lemmatize('cats')
>> [c,a,t,s]

Solution

  • TL;DR

    result1 = dataset['Content'].apply(lambda x: [lemmatizer.lemmatize(y) for y in x.split()])
    

    Lemmatizer takes in any string as an input.

    If dataset['Content'] columns are strings, iterating through a string would be iterating through the characters not "words", e.g.

    >>> from nltk.stem import WordNetLemmatizer
    >>> wnl = WordNetLemmatizer()
    >>> x = 'this is a foo bar sentence, that is of type str'
    >>> [wnl.lemmatize(ch) for ch in x]
    ['t', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' ', 'f', 'o', 'o', ' ', 'b', 'a', 'r', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', ',', ' ', 't', 'h', 'a', 't', ' ', 'i', 's', ' ', 'o', 'f', ' ', 't', 'y', 'p', 'e', ' ', 's', 't', 'r']
    

    So you would have to first word tokenize your sentence string, e.g.:

    >>> from nltk import word_tokenize
    >>> [wnl.lemmatize(word) for word in x.split()]
    ['this', 'is', 'a', 'foo', 'bar', 'sentence,', 'that', 'is', 'of', 'type', 'str']
    >>> [wnl.lemmatize(ch) for ch in word_tokenize(x)]
    ['this', 'is', 'a', 'foo', 'bar', 'sentence', ',', 'that', 'is', 'of', 'type', 'str']
    

    another e.g.

    >>> from nltk import word_tokenize
    >>> x = 'the geese ran through the parks'
    >>> [wnl.lemmatize(word) for word in x.split()]
    ['the', u'goose', 'ran', 'through', 'the', u'park']
    >>> [wnl.lemmatize(ch) for ch in word_tokenize(x)]
    ['the', u'goose', 'ran', 'through', 'the', u'park']
    

    But to get a more accurate lemmatization, you should get the sentence word tokenized and pos-tagged, see https://github.com/alvations/earthy/blob/master/FAQ.md#how-to-use-default-nltk-functions-in-earthy