Search code examples
pythonpython-3.xscikit-learntf-idftfidfvectorizer

How to get top n terms with highest tf-idf score - Big sparse matrix


There is this code:

feature_array = np.array(tfidf.get_feature_names())
tfidf_sorting = np.argsort(response.toarray()).flatten()[::-1]

n = 3
top_n = feature_array[tfidf_sorting][:n]

coming from this answer.

My question is how can I efficiently do this in the case where my sparse matrix is too big to convert at once to a dense matrix (with response.toarray())?

Apparently, the general answer is by splitting the sparse matrix in chunks, doing the conversion of each chunk in a for loop and then combining the results across all chunks.

But I would like to see specifically the code which does this in total.


Solution

  • If you have a deep look at that question, they are interested at knowing top tf_idf scores for a single document.

    when you want to do the same thing for a large corpus, you need to sum the scores of each feature across all documents (still its not meaningfull because the scores are l2 normalized in TfidfVectorizer(), read here). I would recommend using .idf_ scores to know the features with high inverse document frequency score.

    In case, you want to know the top features based on number of occurrences, use CountVectorizer()

    from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
    corpus = [
        'I would like to check this document',
        'How about one more document',
        'Aim is to capture the key words from the corpus'
    ]
    vectorizer = TfidfVectorizer(stop_words='english')
    X = vectorizer.fit_transform(corpus)
    feature_array = vectorizer.get_feature_names()
    
    top_n = 3
    
    print('tf_idf scores: \n', sorted(list(zip(vectorizer.get_feature_names(), 
                                                 X.sum(0).getA1())), 
                                     key=lambda x: x[1], reverse=True)[:top_n])
    # tf_idf scores : 
    # [('document', 1.4736296010332683), ('check', 0.6227660078332259), ('like', 0.6227660078332259)]
    
    print('idf values: \n', sorted(list(zip(feature_array,vectorizer.idf_,)),
           key = lambda x: x[1], reverse=True)[:top_n])
    
    # idf values: 
    #  [('aim', 1.6931471805599454), ('capture', 1.6931471805599454), ('check', 1.6931471805599454)]
    
    vectorizer = CountVectorizer(stop_words='english')
    X = vectorizer.fit_transform(corpus)
    feature_array = vectorizer.get_feature_names()
    print('Frequency: \n', sorted(list(zip(vectorizer.get_feature_names(), 
                                             X.sum(0).getA1())),
                                key=lambda x: x[1], reverse=True)[:top_n])
    
    # Frequency: 
    #  [('document', 2), ('aim', 1), ('capture', 1)]