OK so I'm just studying Andrew Ng's Machine Learning course. I'm currently reading this chapter and want to try the Multinomial Naive Bayes (bottom of page 12) for myself using SKLearn and Python. So Andrew proposes a method, in which each email in this case is encoded so
We let
x_i
denote the identity of thei
-th word in the email. Thus,x_i
is now an integer taking values in{1, . . . , |V|}
, where|V|
is the size of our vocabulary (dictionary). An email of n words is now represented by a vector(x1, x2, . . . , xn)
of lengthn
note that n can vary for different documents. For instance, if an email starts with“A NIPS . . . ,”
thenx_1 = 1
(“a”
is the first word in the dictionary), andx2 = 35000
(if“nips”
is the 35000th word in the dictionary).
See highlights.
So this is also what I did in Python. I have a vocabulary
, which is a list of 502 words, and I encoded each "email" so that it's represented the same way as Andrew describes, for example the message "this is sparta" is represented by [495, 296, 359]
and "this is not sparta" by [495, 296, 415, 359]
.
So here comes the problem.
Apparently, SKLearn's MultinomialNB
requires input with uniform shape (I'm not sure about this, but as of now, I'm getting ValueError: setting an array element with a sequence.
, which I think is because the input vectors are not of same size).
So my question is, how can I use MultinomialNB
for multiple length messages? Is it possible? What am I missing?
Here's some more of what I'm doing with code:
X = posts['wordsencoded'].values
y = posts['highview'].values
clf = MultinomialNB()
clf.fit(X, y)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
print(clf.predict())
Stack trace:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-933-dea987cd8603> in <module>()
3 y = posts['highview'].values
4 clf = MultinomialNB()
----> 5 clf.fit(X, y)
6 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
7 print(clf.predict())
/usr/local/lib/python3.4/dist-packages/sklearn/naive_bayes.py in fit(self, X, y, sample_weight)
525 Returns self.
526 """
--> 527 X, y = check_X_y(X, y, 'csr')
528 _, n_features = X.shape
529
/usr/local/lib/python3.4/dist-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
508 X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite,
509 ensure_2d, allow_nd, ensure_min_samples,
--> 510 ensure_min_features, warn_on_dtype, estimator)
511 if multi_output:
512 y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,
/usr/local/lib/python3.4/dist-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
371 force_all_finite)
372 else:
--> 373 array = np.array(array, dtype=dtype, order=order, copy=copy)
374
375 if ensure_2d:
ValueError: setting an array element with a sequence.
Yes you are thinking right. You have to encode each mail with fixed length vector. This vector is called word count vector of 502 dimensions (in your case) for each email of training set.
Each word count vector contains the frequency of 502 dictionary words in the training file. Of course you might have guessed by now that most of them will be zero. for example : "this is not sparta not is this sparta" will be encoded like below. [0,0,0,0,0,.......0,0,2,0,0,0,......,0,0,2,0,0,...0,0,2,0,0,......2,0,0,0,0,0,0]
Here, all the four 2's are placed at 296th, 359th, 415th, 495th index of 502 length word count vector.
So, a feature vector matrix will be generated whose rows denote number of files of training set and columns denote 502 words of dictionary. The value at index ‘ij’ will be the number of occurrences of jth word of dictionary in ith file.
This generated encoding of emails(feature vector matrix), can be given to the MultinomialNB for training.
You will also have to generate similar 502 length encoding for test email also before predicting the class.
You can easily build a spam filter classifier with multinomialNB on ling-spam dataset using the following blog. The blog-post uses sklearn and python for implementation.
https://appliedmachinelearning.wordpress.com/2017/01/23/nlp-blog-post/