Search code examples
spacylemmatization

Finding non-existing words with spaCy?


I am new to spaCy. I have a (German) text in which I want to find all the words not in the dictionary (using the de_core_news_lg pipeline). Reading spaCy's documentation, the only thing I found that looked promising was Token.has_vector(). When I check all the tokens in the Doc object I get by running nlp(TEXT) I find that, indeed, the tokens for which has_vector() returns False seem to be either typos or rare words not likely to be in the dictionary.

So my hypothesis is that returning False from Token.has_vector() is equivalent to not having found the respective word in the dictionary. Am I correct? Is there a better way for finding words not in dictionary?


Solution

  • spaCy does not include functionality for checking if a word is in the dictionary or not.

    If you've loaded a pipeline with vectors, you can use has_vector to check if a word vector is present for a given token. This is kind of similar to checking if a word is in the dictionary, but it depends on the vectors - for most languages the vectors just include any word that appeared at least a certain number of times in a training corpus, so common typos or other strange things will be present, while some words may be randomly missing.

    If you want to detect "real" words in some way it's best to source your own list.