python scikit-learn nlp naivebayes countvectorizer

python sklearn using more than just the count features for naive bayes learning

first of all, I am new to python and nlp / machine learning. right now I have the following code:

vectorizer = CountVectorizer(
   input="content", 
   decode_error="ignore", 
   strip_accents=None,
   stop_words = stopwords.words('english'),
   tokenizer=myTokenizer
)
counts = vectorizer.fit_transform(data['message'].values)
classifier = MultinomialNB()
targets = data['sentiment'].values
classifier.fit(counts, targets)

now this actually works pretty well. I am getting a sparse matrix through the CountVectorizer and the classifier makes use of the matrix as well as the targets (0,2,4).

However, what would I have to do if I wanted to use more features in the vector instead of just the word counts? I can't seem to find that out. Thank you in advance.

Solution

In your case counts is a sparse matrix; you can add columns to it with extra features:

import numpy as np
from scipy import sparse as sp

counts = vectorizer.fit_transform(data['message'].values)
ones = np.ones(shape=(len(data), 1))
X = sp.hstack([counts, ones])

classifier.fit(X, targets)

scikit-learn also provides a built-in helper for that; it is called FeatureUnion. There is an example of combining features from two transformers in scikit-learn docs:

estimators = [('linear_pca', PCA()), ('kernel_pca', KernelPCA())]
combined = FeatureUnion(estimators)

# then you can do this:
X = combined.fit_transform(my_data)

FeatureUnion does almost the same: it takes a list of vectorizers (with names), calls them all for the same input data, then concatenates the result column-wise.

It is usually better to use FeatureUnion because you will have easier time using scikit-learn cross-validation, pickling the final pipeline, etc.