Search code examples
pythonnltkpunctuation

Python NLTK not taking out punctuations correctly


I have defined the following code

exclude = set(string.punctuation)
lmtzr = nltk.stem.wordnet.WordNetLemmatizer()

wordList= ['"the']
answer = [lmtzr.lemmatize(word.lower()) for word in list(set(wordList)-exclude)]
print answer

I have previously printed exclude and the quotation mark " is part of it. I expected answer to be [the]. However, when I printed answer, it shows up as ['"the']. I'm not entirely sure why it's not taking out the punctuation correctly. Would I need to check each character individually instead?


Solution

  • When you create a set from wordList it stores the string '"the' as the only element,

    >>> set(wordList)
    set(['"the'])
    

    So using set difference will return the same set,

    >>> set(wordList) - set(string.punctuation)
    set(['"the'])
    

    If you want to just remove punctuation you probably want something like,

    >>> [word.translate(None, string.punctuation) for word in wordList]
    ['the']
    

    Here I'm using the translate method of strings, only passing in a second argument specifying which characters to remove.

    You can then perform the lemmatization on the new list.