first of all, I am new to python and nlp / machine learning. right now I have the following code:
vectorizer = CountVectorizer(
input="content",
decode_error="ignore",
strip_accents=None,
stop_words = stopwords.words('english'),
tokenizer=myTokenizer
)
counts = vectorizer.fit_transform(data['message'].values)
classifier = MultinomialNB()
targets = data['sentiment'].values
classifier.fit(counts, targets)
now this actually works pretty well. I am getting a sparse matrix through the CountVectorizer
and the classifier
makes use of the matrix as well as the targets (0,2,4)
.
However, what would I have to do if I wanted to use more features in the vector instead of just the word counts? I can't seem to find that out. Thank you in advance.
In your case counts
is a sparse matrix; you can add columns to it with extra features:
import numpy as np
from scipy import sparse as sp
counts = vectorizer.fit_transform(data['message'].values)
ones = np.ones(shape=(len(data), 1))
X = sp.hstack([counts, ones])
classifier.fit(X, targets)
scikit-learn also provides a built-in helper for that; it is called FeatureUnion. There is an example of combining features from two transformers in scikit-learn docs:
estimators = [('linear_pca', PCA()), ('kernel_pca', KernelPCA())]
combined = FeatureUnion(estimators)
# then you can do this:
X = combined.fit_transform(my_data)
FeatureUnion does almost the same: it takes a list of vectorizers (with names), calls them all for the same input data, then concatenates the result column-wise.
It is usually better to use FeatureUnion because you will have easier time using scikit-learn cross-validation, pickling the final pipeline, etc.
See also these tutorials: