Firstly, I fit it on the corpus of sms:
from sklearn.feature_extraction.text import CountVectorizer
clf = CountVectorizer()
X_desc = clf.fit_transform(X).toarray()
Seems to works fine:
X.shape = (5574,)
X_desc.shape = (5574, 8713)
But then I applied transform method to the textline, as we know, it should have (, 8713) shape as a result, but what we see:
str2 = 'Have you visited the last lecture on physics?'
print len(str2), clf.transform(str2).toarray().shape
52 (52, 8713)
What is going on here? One more thing - all numbers are zeros
You always need to pass an array or vector to transform
; if you just want to transform a single element, you need to pass a singleton array, and then extract its contents:
clf.transform([str1])[0]
Incidentally the reason that you are getting a 2-dimensional array as output is that the a string is actually stored as a list of characters, and so the vectoriser is treating your string as an array, where each character is being considered as a single document.