Search code examples
python-3.xscikit-learnnlptext-miningcountvectorizer

Access document-term matrix without calling .fit_transform() each time


If I've already called vectorizer.fit_transform(corpus), is the only way to later print the document-term matrix to call vectorizer.fit_transform(corpus) again?

from sklearn.feature_extraction.text import CountVectorizer
corpus = ['the', 'quick','brown','fox']
vectorizer = CountVectorizer(stop_words='english')
vectorizer.fit_transform(corpus) # Returns the document-term matrix

My understanding is by doing above, I've now saved terms into the vectorizer object. I assume this because I can now call vectorizer.vocabulary_ without passing in corpus again.

So I wondered why there is not a method like .document_term_matrix?

Its seems weird that I have to pass in the corpus again if the data is now already stored in vectorizer object. But per the docs, only .fit, .transform, and .fit_transformreturn the mattrix.

Docs: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer.fit

Other Info:

I'm using Anaconda and Jupyter Notebook.


Solution

  • You can simply assign the fit to a variable dtm, and, since it is a Scipy sparse matrix, use the toarray method to print it:

    from sklearn.feature_extraction.text import CountVectorizer
    corpus = ['the', 'quick','brown','fox']
    vectorizer = CountVectorizer(stop_words='english')
    dtm = vectorizer.fit_transform(corpus)
    
    # vectorizer object is still fit:
    vectorizer.vocabulary_
    # {'brown': 0, 'fox': 1, 'quick': 2}
    
    dtm.toarray()
    # array([[0, 0, 0],
    #        [0, 0, 1],
    #        [1, 0, 0],
    #        [0, 1, 0]], dtype=int64)
    

    although I guess for any realistic document-term matrix this will be really impractical... You could use the nonzero method instead:

    dtm.nonzero()
    # (array([1, 2, 3], dtype=int32), array([2, 0, 1], dtype=int32))