Search code examples
pythonpython-2.7textscikit-learnsklearn-pandas

CountVectorizer: transform method returns multidimensional array on a single text line


Firstly, I fit it on the corpus of sms:

from sklearn.feature_extraction.text import CountVectorizer
clf = CountVectorizer()
X_desc = clf.fit_transform(X).toarray()

Seems to works fine:

X.shape = (5574,)
X_desc.shape = (5574, 8713)

But then I applied transform method to the textline, as we know, it should have (, 8713) shape as a result, but what we see:

str2 = 'Have you visited the last lecture on physics?'
print len(str2), clf.transform(str2).toarray().shape

52 (52, 8713)

What is going on here? One more thing - all numbers are zeros


Solution

  • You always need to pass an array or vector to transform; if you just want to transform a single element, you need to pass a singleton array, and then extract its contents:

    clf.transform([str1])[0]
    

    Incidentally the reason that you are getting a 2-dimensional array as output is that the a string is actually stored as a list of characters, and so the vectoriser is treating your string as an array, where each character is being considered as a single document.