python machine-learning scikit-learn text-classification multinomial

MultinomialNB - Theory vs practice

OK so I'm just studying Andrew Ng's Machine Learning course. I'm currently reading this chapter and want to try the Multinomial Naive Bayes (bottom of page 12) for myself using SKLearn and Python. So Andrew proposes a method, in which each email in this case is encoded so

We let x_i denote the identity of the i-th word in the email. Thus, x_i is now an integer taking values in {1, . . . , |V|}, where |V| is the size of our vocabulary (dictionary). An email of n words is now represented by a vector (x1, x2, . . . , xn) of length n note that n can vary for different documents. For instance, if an email starts with “A NIPS . . . ,” then x_1 = 1 (“a” is the first word in the dictionary), and x2 = 35000 (if “nips” is the 35000th word in the dictionary).

See highlights.

So this is also what I did in Python. I have a vocabulary, which is a list of 502 words, and I encoded each "email" so that it's represented the same way as Andrew describes, for example the message "this is sparta" is represented by [495, 296, 359] and "this is not sparta" by [495, 296, 415, 359].

So here comes the problem.

Apparently, SKLearn's MultinomialNB requires input with uniform shape (I'm not sure about this, but as of now, I'm getting ValueError: setting an array element with a sequence., which I think is because the input vectors are not of same size).

So my question is, how can I use MultinomialNB for multiple length messages? Is it possible? What am I missing?

Here's some more of what I'm doing with code:

X = posts['wordsencoded'].values
y = posts['highview'].values
clf = MultinomialNB()
clf.fit(X, y)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
print(clf.predict())

What the input looks like:

Stack trace:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-933-dea987cd8603> in <module>()
      3 y = posts['highview'].values
      4 clf = MultinomialNB()
----> 5 clf.fit(X, y)
      6 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
      7 print(clf.predict())

/usr/local/lib/python3.4/dist-packages/sklearn/naive_bayes.py in fit(self, X, y, sample_weight)
    525             Returns self.
    526         """
--> 527         X, y = check_X_y(X, y, 'csr')
    528         _, n_features = X.shape
    529 

/usr/local/lib/python3.4/dist-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
    508     X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite,
    509                     ensure_2d, allow_nd, ensure_min_samples,
--> 510                     ensure_min_features, warn_on_dtype, estimator)
    511     if multi_output:
    512         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

/usr/local/lib/python3.4/dist-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    371                                       force_all_finite)
    372     else:
--> 373         array = np.array(array, dtype=dtype, order=order, copy=copy)
    374 
    375         if ensure_2d:

ValueError: setting an array element with a sequence.

Solution

Yes you are thinking right. You have to encode each mail with fixed length vector. This vector is called word count vector of 502 dimensions (in your case) for each email of training set.

Each word count vector contains the frequency of 502 dictionary words in the training file. Of course you might have guessed by now that most of them will be zero. for example : "this is not sparta not is this sparta" will be encoded like below. [0,0,0,0,0,.......0,0,2,0,0,0,......,0,0,2,0,0,...0,0,2,0,0,......2,0,0,0,0,0,0]

Here, all the four 2's are placed at 296th, 359th, 415th, 495th index of 502 length word count vector.

So, a feature vector matrix will be generated whose rows denote number of files of training set and columns denote 502 words of dictionary. The value at index ‘ij’ will be the number of occurrences of jth word of dictionary in ith file.

This generated encoding of emails(feature vector matrix), can be given to the MultinomialNB for training.

You will also have to generate similar 502 length encoding for test email also before predicting the class.

You can easily build a spam filter classifier with multinomialNB on ling-spam dataset using the following blog. The blog-post uses sklearn and python for implementation.

https://appliedmachinelearning.wordpress.com/2017/01/23/nlp-blog-post/