Search code examples
pandasmachine-learningtextnlpoov

How to handle out of vocab words with bag of words


I am attempting to use BoW before ML on my text based dataset. But, I do not want my training set to influence my test set (i.e., data leakage). I want to deploy BoW on the train set before the test set. But, then my test set has different features (i.e., words) than my train set so the matrices are not the same size. I tried keeping columns in the test set that also appear in the train set but 1) My code is not right and 2) I do not think this is the most efficient procedure. I think I also need code to add filler columns? Here is what I have:

from sklearn.feature_extraction.text import CountVectorizer

def bow (tokens, data):
    tokens = tokens.apply(nltk.word_tokenize)
    cvec = CountVectorizer(min_df = .01, max_df = .99, ngram_range=(1,2), tokenizer=lambda doc:doc, lowercase=False)
    cvec.fit(tokens)
    cvec_counts = cvec.transform(tokens)
    cvec_counts_bow = cvec_counts.toarray()
    vocab = cvec.get_feature_names()
    bow_model = pd.DataFrame(cvec_counts_bow, columns=vocab)
    return bow_model

X_train = bow(train['text'], train)
X_test = bow(test['text'], test)

vocab = list(X_train.columns)
X_test = test.filter.columns([w for w in X_test if w in vocab])


Solution

  • You would normaliy fit the CountVectorizer only on the train set and use the same Vectorizer on the test set, e.g:

    from sklearn.feature_extraction.text import CountVectorizer
    
    def bow (tokens, data, cvec=None):
        tokens = tokens.apply(nltk.word_tokenize)
        if cvec==None:
            cvec = CountVectorizer(min_df = .01, max_df = .99, ngram_range=(1,2), tokenizer=lambda doc:doc, lowercase=False)
            cvec.fit(tokens)
        cvec_counts = cvec.transform(tokens)
        cvec_counts_bow = cvec_counts.toarray()
        vocab = cvec.get_feature_names()
        bow_model = pd.DataFrame(cvec_counts_bow, columns=vocab)
        return bow_model, cvec
    
    X_train, cvec = bow(train['text'], train)
    X_test, cvec = bow(test['text'], test, cvec=cvec)
    
    vocab = list(X_train.columns)
    X_test = test.filter.columns([w for w in X_test if w in vocab])
    

    This will of course ignore words not seen on the train set, but this shouldn't be a problem because train and test should have more or less the same distribution and therefore unknown words should be rare .

    Note: Code is not tested