Search code examples
pythonnlpspacystop-wordslemmatization

Detect stopword after lemma in Spacy


How to detect if word is a stopword after stemming and lemmatization in spaCy?

Assume sentence

s = "something good\nsomethings 2 bad"

In this case something is a stopword. Obviously (to me?) Something and somethings are also stopwords, but it needs to stemmed before. Following script will say that the first is true, but latter isn't.

import spacy
from spacy.tokenizer import Tokenizer
nlp = spacy.load('en')
tokenizer = Tokenizer(nlp.vocab)

s = "something good\nSomething 2 somethings"
tokens = tokenizer(s)

for token in tokens:
  print(token.lemma_, token.is_stop)

Returns:

something True
good False
"\n" False
Something False
2 False
somethings False

Is there a way to detect that through spaCy API?


Solution

  • Stop words in spaCy are just a set of strings which set a flag on the lexemes, the context-independent entries in the vocabulary (see here for the English stop list). The flag simply checks whether text in STOP_WORDS, which is why "something" returns True for is_stop, and "somethings" doesn't.

    However, what you can do is check if the token's lemma or lowercase form is part of the stop list, which is available via nlp.Defaults.stop_words (i.e. the defaults of the language you're using):

    def extended_is_stop(token):
        stop_words = nlp.Defaults.stop_words
        return token.is_stop or token.lower_ in stop_words or token.lemma_ in stop_words
    

    If you're using spaCy v2.0 and want to solve this even more elegantly, you could also implement your own is_stop function via a custom Token attribute extension. You can choose any name for your attribute and it will become available via token._., for example token._.is_stop:

    from spacy.tokens import Token
    from spacy.lang.en.stop_words import STOP_WORDS  # import stop words from language data
    
    stop_words_getter = lambda token: token.is_stop or token.lower_ in STOP_WORDS or token.lemma_ in STOP_WORDS
    Token.set_extension('is_stop', getter=stop_words_getter)  # set attribute with getter
    
    nlp = spacy.load('en')
    doc = nlp("something Something somethings")
    assert doc[0]._.is_stop  # this was a stop word before, and still is
    assert doc[1]._.is_stop  # this is now also a stop word, because its lowercase form is
    assert doc[2]._.is_stop  # this is now also a stop word, because its lemma is