I'm building custom ngram vectorizer for bag of word model. I'm qurious - what should I do if during vectorizing of a short text I found new token, which not exists in corpus vocabulary. Should it be just skipped or what?
You can either skip it or you can add a special token to the vocabulary for unknown words, e.g. previously unseen words are replaced with "UNK"
and then you can count them just the same as any other word. Also, to deal with the problem of not having any UNK
s in the training data, you can replace all words that only occur once in the corpus with UNK
.