Search code examples
pythonscikit-learnrankingtf-idf

Sorting TfidfVectorizer output by tf-idf (lowest to highest and vice versa)


I'm using TfidfVectorizer() from sklearn on part of my text data to get a sense of term-frequency for each feature (word). My current code is the following

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(analyzer='word', stop_words = 'english')

# fit_transform on training data
X_traintfidf = tfidf.fit_transform(X_train)

If I want to sort the tf-idf values of each term in 'X_traintfidf' from the lowest to highest (and vice versa), say, top10, and make these sorted tf-idf value rankings into two Series objects, how should I proceed from the last line of my code?

Thank you.

I was reading a similar thread but couldn't figure out how to do it. Maybe someone will be able to connect the tips shown in that thread to my question here.


Solution

  • After the fit_transform(), you'll have access to the existing vocabulary through get_feature_names() method. You can do this:

    terms = tfidf.get_feature_names()
    
    # sum tfidf frequency of each term through documents
    sums = X_traintfidf.sum(axis=0)
    
    # connecting term to its sums frequency
    data = []
    for col, term in enumerate(terms):
        data.append( (term, sums[0,col] ))
    
    ranking = pd.DataFrame(data, columns=['term','rank'])
    print(ranking.sort_values('rank', ascending=False))