Search code examples
pythonscikit-learnnlpnaivebayescountvectorizer

python sklearn using more than just the count features for naive bayes learning


first of all, I am new to python and nlp / machine learning. right now I have the following code:

vectorizer = CountVectorizer(
   input="content", 
   decode_error="ignore", 
   strip_accents=None,
   stop_words = stopwords.words('english'),
   tokenizer=myTokenizer
)
counts = vectorizer.fit_transform(data['message'].values)
classifier = MultinomialNB()
targets = data['sentiment'].values
classifier.fit(counts, targets)

now this actually works pretty well. I am getting a sparse matrix through the CountVectorizer and the classifier makes use of the matrix as well as the targets (0,2,4).

However, what would I have to do if I wanted to use more features in the vector instead of just the word counts? I can't seem to find that out. Thank you in advance.


Solution

  • In your case counts is a sparse matrix; you can add columns to it with extra features:

    import numpy as np
    from scipy import sparse as sp
    
    counts = vectorizer.fit_transform(data['message'].values)
    ones = np.ones(shape=(len(data), 1))
    X = sp.hstack([counts, ones])
    
    classifier.fit(X, targets)
    

    scikit-learn also provides a built-in helper for that; it is called FeatureUnion. There is an example of combining features from two transformers in scikit-learn docs:

    estimators = [('linear_pca', PCA()), ('kernel_pca', KernelPCA())]
    combined = FeatureUnion(estimators)
    
    # then you can do this:
    X = combined.fit_transform(my_data)
    

    FeatureUnion does almost the same: it takes a list of vectorizers (with names), calls them all for the same input data, then concatenates the result column-wise.

    It is usually better to use FeatureUnion because you will have easier time using scikit-learn cross-validation, pickling the final pipeline, etc.

    See also these tutorials: