Search code examples
pythonnlpnltkstop-words

Stopword removal with NLTK


I am trying to process a user entered text by removing stopwords using nltk toolkit, but with stopword-removal the words like 'and', 'or', 'not' gets removed. I want these words to be present after stopword removal process as they are operators which are required for later processing text as query. I don't know which are the words which can be operators in text query, and I also want to remove unnecessary words from my text.


Solution

  • I suggest you create your own list of operator words that you take out of the stopword list. Sets can be conveniently subtracted, so:

    operators = set(('and', 'or', 'not'))
    stop = set(stopwords...) - operators
    

    Then you can simply test if a word is in or not in the set without relying on whether your operators are part of the stopword list. You can then later switch to another stopword list or add an operator.

    if word.lower() not in stop:
        # use word