Search code examples
nlpspacystop-words

Mismatch in the count of stop-words in `Defaults.stop_words` and the ones derived from `nlp.vocab`?


Suppose we have nlp = spacy.load('en_core_web_sm'). When typing in len(nlp.Defaults.stop_words), it returns 326, but when I run the following code (essentially counting the stopwords of the vocabulary), I get 111:

i=0
for word in nlp.vocab:
    if word.is_stop:
        print(word.text)
        i+=1
print(i)

Given that (presumably) both Defaults.stop_words and nlp.vocab work with the same underlying vocabulary loaded through nlp = spacy.load('en_core_web_sm'), I don't understand why the number mismatch. Any thoughts?


Solution

  • The actual default list of stopwords that are used for checking if any token is_stop is obtained with nlp.Defaults.stop_words, so the list contains 326 words.

    The mismatch derives from the fact that the nlp.vocab is a Vocab containing Lexemes (word-types) that are kept for different reasons such as improving efficiency and that gets updated with new entries while processing new documents. When you initialize the Language (nlp), the Vocab will contain a certain number of default entries (in my case 764) and you will see this number increase when you process new documents with words actually present in your documents.

    So, with the cycle in the example we are just checking if some of these default entries of the language Vocab are present on the list of stopwords which contains 326 words and which actually is nlp.Defaults.stop_words.