I am trying to train a text classifier using sklearn's CountVectorizer. The problem is that my training documents have many tokens that are document-specific. So for example there are regular english words that the CountVectorizer.fit_transform method works perfectly well on, but then there are some tokens that are formatted that would fit the regex: '\w\d\d\w\w\d', such as 'd84ke2'. As it is now, the fit_transform method would just take 'd84ke2' at face value and use that as a feature.
I want to be able to use those specific tokens that match that specific regex as their own feature, and leave the regular english words as their own features, since creating a feature such as 'd84ke2' would be useless as this will not come up again in any other document.
I've yet to find a way to do this, much less the "best" way. Below is an example of code I have, where you can see that the tokens 'j64ke2', 'r32kl4', 'w35kf9', and 'e93mf9' are all turned into their own features. I repeat for clarity: I want to basically condense these features into one and keep the others.
docs = ['the quick brown j64ke2 jumped over the lazy dogs r32kl4.',
'an apple a day keeps the w35kf9 away',
'you got the lions share of the e93mf9']
import numpy as np
# define target and target_names
target_names = ['zero', 'one', 'two']
target = np.array([0, 1, 2])
# Create message bunch.
from sklearn.utils import Bunch
doc_info = Bunch(data=docs, target=target, target_names=target_names)
# Vectorize training data
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
count_vect.fit(doc_info.data)
vocab = count_vect.vocabulary_
vocab_keys = list(vocab.keys())
#vocab_vals = list(vocab.values())
X_train_counts = count_vect.transform(doc_info.data)
X = X_train_counts.toarray()
import pandas as pd
df = pd.DataFrame(X, columns=vocab_keys)
yatu's comment is a good solution. I was able to clean the document before feeding it to CountVectorizer
by substituting a word for each regex that matched.