If I've already called vectorizer.fit_transform(corpus)
, is the only way to later print the document-term matrix to call vectorizer.fit_transform(corpus)
again?
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['the', 'quick','brown','fox']
vectorizer = CountVectorizer(stop_words='english')
vectorizer.fit_transform(corpus) # Returns the document-term matrix
My understanding is by doing above, I've now saved terms into the vectorizer
object. I assume this because I can now call vectorizer.vocabulary_
without passing in corpus
again.
So I wondered why there is not a method like .document_term_matrix
?
Its seems weird that I have to pass in the corpus
again if the data is now already stored in vectorizer
object. But per the docs, only .fit
, .transform
, and .fit_transform
return the mattrix.
Other Info:
I'm using Anaconda and Jupyter Notebook.
You can simply assign the fit to a variable dtm
, and, since it is a Scipy sparse matrix, use the toarray
method to print it:
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['the', 'quick','brown','fox']
vectorizer = CountVectorizer(stop_words='english')
dtm = vectorizer.fit_transform(corpus)
# vectorizer object is still fit:
vectorizer.vocabulary_
# {'brown': 0, 'fox': 1, 'quick': 2}
dtm.toarray()
# array([[0, 0, 0],
# [0, 0, 1],
# [1, 0, 0],
# [0, 1, 0]], dtype=int64)
although I guess for any realistic document-term matrix this will be really impractical... You could use the nonzero
method instead:
dtm.nonzero()
# (array([1, 2, 3], dtype=int32), array([2, 0, 1], dtype=int32))