Search code examples
pythonpython-3.xscikit-learncountvectorizer

How to get column sum in the matrix returned by sklearn count vectorizer?


How to get the sum of any given column in the term frequency matrix returned by sklearn CountVectorizer?

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

corpus = [ 'This is a sentence',
           'Another sentence is here',
           'Wait for another sentence',
           'The sentence is coming',
           'The sentence has come'
         ]

x = vectorizer.fit_transform(corpus)

For example I want to find out the frequency of sentence in the matrix. So I want the sum of the sentence column. I couldn't figure out a way to do this:

  • For example I tried x['sentence'].sum() but that didn't help
  • I also tried converting this to a pandas dataframe and computing the sum, but I shouldn't need to convert this matrix to a dataframe.

Solution

  • You can try the following:

    1. Get the position of your term in the feature_names() list from CountVectorizer.
    2. Use the position to sum all that column in the CSR matrix (x, in your case).

    Code:

    import numpy as np
    
    term_to_sum = 'sentence'    
    index_term = vectorizer.get_feature_names().index(term_to_sum)
    
    s = np.sum(x[:, index_term])  # here you get the sum