Search code examples
pythonscikit-learnnlpcountvectorizerpython-re

Vectorize document based on vocabulary AND regex


I am trying to train a text classifier using sklearn's CountVectorizer. The problem is that my training documents have many tokens that are document-specific. So for example there are regular english words that the CountVectorizer.fit_transform method works perfectly well on, but then there are some tokens that are formatted that would fit the regex: '\w\d\d\w\w\d', such as 'd84ke2'. As it is now, the fit_transform method would just take 'd84ke2' at face value and use that as a feature.

I want to be able to use those specific tokens that match that specific regex as their own feature, and leave the regular english words as their own features, since creating a feature such as 'd84ke2' would be useless as this will not come up again in any other document.

I've yet to find a way to do this, much less the "best" way. Below is an example of code I have, where you can see that the tokens 'j64ke2', 'r32kl4', 'w35kf9', and 'e93mf9' are all turned into their own features. I repeat for clarity: I want to basically condense these features into one and keep the others.

docs = ['the quick brown j64ke2 jumped over the lazy dogs r32kl4.', 
        'an apple a day keeps the w35kf9 away', 
        'you got the lions share of the e93mf9']

import numpy as np
# define target and target_names  
target_names = ['zero', 'one', 'two']
target = np.array([0, 1, 2])

# Create message bunch. 
from sklearn.utils import Bunch
doc_info = Bunch(data=docs, target=target, target_names=target_names)


# Vectorize training data
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
count_vect.fit(doc_info.data)

vocab = count_vect.vocabulary_
vocab_keys = list(vocab.keys())
#vocab_vals = list(vocab.values())

X_train_counts = count_vect.transform(doc_info.data)
X = X_train_counts.toarray()        
import pandas as pd
df = pd.DataFrame(X, columns=vocab_keys)

Solution

  • yatu's comment is a good solution. I was able to clean the document before feeding it to CountVectorizer by substituting a word for each regex that matched.