Search code examples
pythonsparse-matrixtext-classificationcountvectorizer

The names of the columns in CountVectorier sparse matrix in python


When I use the code below:

from sklearn.feature_extraction.text import CountVectorizer
X = dataset.Tweet
y = dataset.Type

count_vect = CountVectorizer()
BoW = count_vect.fit_transform(X)

It returns the term frequency document as a sparse matrix.

I found out how to get the data, indices, and indptr of the sparse matrix.

My problem is how can I get the names of the columns (which should be the features or words)?


Solution

  • What you want to use is vectorizer.get_feature_names(). Here is an example from the docs:

    from sklearn.feature_extraction.text import CountVectorizer
    corpus = [
        'This is the first document.',
        'This document is the second document.',
        'And this is the third one.',
        'Is this the first document?',
    ]
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(corpus)
    print(vectorizer.get_feature_names())
    # ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
    print(X.toarray())  
    # [[0 1 1 1 0 0 1 0 1]
    #  [0 2 0 1 0 1 1 0 1]
    #  [1 0 0 1 1 0 1 1 1]
    #  [0 1 1 1 0 0 1 0 1]]
    

    Docs link: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html