I am attempting to use BoW before ML on my text based dataset. But, I do not want my training set to influence my test set (i.e., data leakage). I want to deploy BoW on the train set before the test set. But, then my test set has different features (i.e., words) than my train set so the matrices are not the same size. I tried keeping columns in the test set that also appear in the train set but 1) My code is not right and 2) I do not think this is the most efficient procedure. I think I also need code to add filler columns? Here is what I have:
from sklearn.feature_extraction.text import CountVectorizer
def bow (tokens, data):
tokens = tokens.apply(nltk.word_tokenize)
cvec = CountVectorizer(min_df = .01, max_df = .99, ngram_range=(1,2), tokenizer=lambda doc:doc, lowercase=False)
cvec.fit(tokens)
cvec_counts = cvec.transform(tokens)
cvec_counts_bow = cvec_counts.toarray()
vocab = cvec.get_feature_names()
bow_model = pd.DataFrame(cvec_counts_bow, columns=vocab)
return bow_model
X_train = bow(train['text'], train)
X_test = bow(test['text'], test)
vocab = list(X_train.columns)
X_test = test.filter.columns([w for w in X_test if w in vocab])
You would normaliy fit the CountVectorizer only on the train set and use the same Vectorizer on the test set, e.g:
from sklearn.feature_extraction.text import CountVectorizer
def bow (tokens, data, cvec=None):
tokens = tokens.apply(nltk.word_tokenize)
if cvec==None:
cvec = CountVectorizer(min_df = .01, max_df = .99, ngram_range=(1,2), tokenizer=lambda doc:doc, lowercase=False)
cvec.fit(tokens)
cvec_counts = cvec.transform(tokens)
cvec_counts_bow = cvec_counts.toarray()
vocab = cvec.get_feature_names()
bow_model = pd.DataFrame(cvec_counts_bow, columns=vocab)
return bow_model, cvec
X_train, cvec = bow(train['text'], train)
X_test, cvec = bow(test['text'], test, cvec=cvec)
vocab = list(X_train.columns)
X_test = test.filter.columns([w for w in X_test if w in vocab])
This will of course ignore words not seen on the train set, but this shouldn't be a problem because train and test should have more or less the same distribution and therefore unknown words should be rare .
Note: Code is not tested