Before I lodge this question, I have to say I've thoroughly read more than 15 similar topics on this board, each with somehow different recommendations, but all of them just could not get me right.
Ok, so I split my 'spam email' text data (originally in csv format) into training and test sets, using CountVectorizer and its 'fit_transform' function to fit the vocabulary of the corpus and extracts word count features from text. And then I applied MultinomialNB() to learn from training set and predict on test set. Here is my code (simplified):
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import MultinomialNB
# loading data
# data contains two columns ('text', 'target')
spam = pd.read_csv('spam.csv')
spam['target'] = np.where(spam_data['target']=='spam',1,0)
# split data
X_train, X_test, y_train, y_test = train_test_split(spam_data['text'], spam_data['target'], random_state=0)
# fit vocabulary and extract word count features
cv = CountVectorizer()
X_traincv = cv.fit_transform(X_train)
X_testcv = cv.fit_transform(X_test)
# learn and predict using MultinomialNB
clfNB = MultinomialNB(alpha=0.1)
clfNB.fit(X_traincv, y_train)
# so far so good, but when I predict on X_testcv
y_pred = algo.predict(X_testcv)
# Python throws me an error: dimension mismatch
The suggestions I gleaned from previous question threads are to (1) use only .transform() on X_test, or (2) ascertain if each row in the original spam data is on string format (yes, they are), or (3) do nothing on X_test. But all of them didn't ring the bell and Python kept giving me 'dimension mismatch' error. After struggling for 4 hours, I had to succumb to Stackoverflow. It will be truly appreciated if anyone could enlighten me on this. Just want to know what goes wrong with my code and how to get the dimension right.
Thank you.
Btw, the original data entries look like this
_
test target
0 Go until jurong point, crazy.. Available only 0
1 Ok lar... Joking wif u oni... 0
2 Free entry in 2 a wkly comp to win FA Cup fina 1
3 U dun say so early hor... U c already then say 0
4 Nah I don't think he goes to usf, he lives aro 0
5 FreeMsg Hey there darling it's been 3 week's n 1
6 WINNER!! As a valued network customer you have 1
Your CountVectorizer
has already been fitted with the training data. So for your test data, you just want to call transform()
, not fit_transform()
.
Otherwise, if you use fit_transform()
again on your test data, you get different columns based on the unique vocabulary of the test data. So just fit once for training.
X_testcv = cv.transform(X_test)