How to detect if word is a stopword after stemming and lemmatization in spaCy
Assume sentence
s = "something good\nsomethings 2 bad"
In this case something
is a stopword. Obviously (to me?) Something
and somethings
are also stopwords, but it needs to stemmed before. Following script will say that the first is true, but latter isn't.
import spacy
from spacy.tokenizer import Tokenizer
nlp = spacy.load('en')
tokenizer = Tokenizer(nlp.vocab)
s = "something good\nSomething 2 somethings"
tokens = tokenizer(s)
for token in tokens:
print(token.lemma_, token.is_stop)
something True
good False
"\n" False
Something False
2 False
somethings False
Is there a way to detect that through spaCy
Stop words in spaCy are just a set of strings which set a flag on the lexemes, the context-independent entries in the vocabulary (see here for the English stop list). The flag simply checks whether text in STOP_WORDS
, which is why "something" returns True
for is_stop
, and "somethings" doesn't.
However, what you can do is check if the token's lemma or lowercase form is part of the stop list, which is available via nlp.Defaults.stop_words
(i.e. the defaults of the language you're using):
def extended_is_stop(token):
stop_words = nlp.Defaults.stop_words
return token.is_stop or token.lower_ in stop_words or token.lemma_ in stop_words
If you're using spaCy v2.0 and want to solve this even more elegantly, you could also implement your own is_stop
function via a custom Token
attribute extension. You can choose any name for your attribute and it will become available via token._.
, for example token._.is_stop
from spacy.tokens import Token
from spacy.lang.en.stop_words import STOP_WORDS # import stop words from language data
stop_words_getter = lambda token: token.is_stop or token.lower_ in STOP_WORDS or token.lemma_ in STOP_WORDS
Token.set_extension('is_stop', getter=stop_words_getter) # set attribute with getter
nlp = spacy.load('en')
doc = nlp("something Something somethings")
assert doc[0]._.is_stop # this was a stop word before, and still is
assert doc[1]._.is_stop # this is now also a stop word, because its lowercase form is
assert doc[2]._.is_stop # this is now also a stop word, because its lemma is