Search code examples
pythonscikit-learntext-mining

How to obtain TF using only TfidfVectorizer?


I have a code like this one:

 corpus = [
        'This is the first document.',
        'This document is the second document.',
        'And this is the third one.',
        'This document is the fourth document.',
        'And this is the fifth one.',
        'This document is the sixth.',
        'And this is the seventh one document.',
        'This document is the eighth.',
        'And this is the nineth one document.',
        'This document is the second.',
        'And this is the tenth one document.',
    ]

    vectorizer = skln.TfidfVectorizer() 
    X = vectorizer.fit_transform(corpus)
    tfidf_matrix = X.toarray()
    accumulated = [0] * len(vectorizer.get_feature_names())

    for i in range(tfidf_matrix.shape[0]):
        for j in range(len(vectorizer.get_feature_names())):
            accumulated[j] += tfidf_matrix[i][j]

    accumulated = sorted(accumulated)[-CENTRAL_TERMS:]
    print(accumulated)

where I print the CENTRAL_TERMS words which get the highest tf-idf scores over all the documents of the corpus.

However, I also want to get the MOST_REPEATED_TERMS words over all the documents of the corpus. These are the words which have the highest tf scores. I know I can obtain by simply using CountVectorizer, but I want to use only TfidfVectorizer (in order to not performing first the vectorizer.fit_transform(corpus) for the TfidfVectorizer and then the vectorizer.fit_transform(corpus) for the CountVectorizer. I also know that I could use first CountVectorizer (to obtain tf scores) followed by TfidfTransformer (to obtain tf-idf scores). However, I think that there must be a way to this only using TfidfVectorizer.

Let me know if there is a way to do this (any information is welcome).


Solution

  • By default, TfidfVectorizer does the l2 normalization after multiplying the tf and idf. Hence we cannot get the term frequency, when you have the norm='l2'. Refer here and here

    If you can work without norm, then there is a solution.

    import scipy.sparse as sp
    import pandas as pd 
    
    vectorizer = TfidfVectorizer(norm=None) 
    X = vectorizer.fit_transform(corpus)
    features = vectorizer.get_feature_names()
    n = len(features)
    inverse_idf = sp.diags(1/vectorizer.idf_,
                           offsets=0,
                           shape=(n, n),
                           format='csr',
                           dtype=np.float64).toarray()
    
    pd.DataFrame(X*inverse_idf, 
                columns=features)
    

    enter image description here