python-3.x scikit-learn nlp text-mining countvectorizer

Access document-term matrix without calling .fit_transform() each time

If I've already called vectorizer.fit_transform(corpus), is the only way to later print the document-term matrix to call vectorizer.fit_transform(corpus) again?

from sklearn.feature_extraction.text import CountVectorizer
corpus = ['the', 'quick','brown','fox']
vectorizer = CountVectorizer(stop_words='english')
vectorizer.fit_transform(corpus) # Returns the document-term matrix

My understanding is by doing above, I've now saved terms into the vectorizer object. I assume this because I can now call vectorizer.vocabulary_ without passing in corpus again.

So I wondered why there is not a method like .document_term_matrix?

Its seems weird that I have to pass in the corpus again if the data is now already stored in vectorizer object. But per the docs, only .fit, .transform, and .fit_transformreturn the mattrix.

Docs: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer.fit

Other Info:

I'm using Anaconda and Jupyter Notebook.

Solution

You can simply assign the fit to a variable dtm, and, since it is a Scipy sparse matrix, use the toarray method to print it:

from sklearn.feature_extraction.text import CountVectorizer
corpus = ['the', 'quick','brown','fox']
vectorizer = CountVectorizer(stop_words='english')
dtm = vectorizer.fit_transform(corpus)

# vectorizer object is still fit:
vectorizer.vocabulary_
# {'brown': 0, 'fox': 1, 'quick': 2}

dtm.toarray()
# array([[0, 0, 0],
#        [0, 0, 1],
#        [1, 0, 0],
#        [0, 1, 0]], dtype=int64)

although I guess for any realistic document-term matrix this will be really impractical... You could use the nonzero method instead:

dtm.nonzero()
# (array([1, 2, 3], dtype=int32), array([2, 0, 1], dtype=int32))