Search code examples
pythonscikit-learnscipycountvectorizer

Term relative frequency matrix from CountVectorizer


Is there a way to obtain the relative frequency matrix starting from the absolute frequency matrix (obtained with the CountVectorizer method)? This is the code used:

body = [
    'the quick brown fox',
    'the slow brown dog',
    'the quick red dog',
    'the lazy yellow fox'
]

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words='english')
bag_of_words = vectorizer.fit_transform(body)

from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=2)
lsa = svd.fit_transform(bag_of_words) 

My goal is to use the function fit_transform() (in the last row of my code) not with the absolute frequency matrix but with the relative frequency matrix. In particular, I would like to find a way to divide each row of the matrix bag_of_words by the sum of the row itself. This is not immediate for me as the matrix is ​​sparse.

Any advice or suggestion is appreciated. Thank you.


Solution

  • This can be done using TfidfVectorizer instead of CountVectorizer. However, this requires changing the following default parameters:

    • you can remove the "idf" part of the tfidf vectorizer, leaving only term frequency
    • by default, the counts are normalized by the L2 norm, what you want here (normalizing by the sum of all counts) is the L1 norm

    In practice, it would look like this:

    from sklearn.feature_extraction.text import TfidfVectorizer
    body = [
        'the quick brown fox',
        'the slow brown dog',
        'the quick red dog',
        'the lazy yellow fox'
    ]
    vectorizer = TfidfVectorizer(use_idf=False, norm="l1")
    X = vectorizer.fit_transform(body)
    print(vectorizer.get_feature_names())
    

    This will return:

    array([[0.25, 0.  , 0.25, 0.  , 0.25, 0.  , 0.  , 0.25, 0.  ],
           [0.25, 0.25, 0.  , 0.  , 0.  , 0.  , 0.25, 0.25, 0.  ],
           [0.  , 0.25, 0.  , 0.  , 0.25, 0.25, 0.  , 0.25, 0.  ],
           [0.  , 0.  , 0.25, 0.25, 0.  , 0.  , 0.  , 0.25, 0.25]])
    
    ['brown', 'dog', 'fox', 'lazy', 'quick', 'red', 'slow', 'the', 'yellow']