Search code examples
pythonpandasscikit-learntf-idftfidfvectorizer

Which 10 words has the highest TF-IDF value in each document / total?


I am trying to get the words with the 10 highest TF-IDF scores for each document.

I have a column in my dataframe that contains the preprocessed text (without punctuation, stop words, etc.) from my various documents. One row means one document in this example.

my dataframe

It has over 500 rows and I am curious about the most important words in each row.

So I ran the following code:

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(df['liststring'])
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()
denselist = dense.tolist()
df2 = pd.DataFrame(denselist, columns=feature_names)

Which gives me a TF-IDF matrix:

tf idf matrix

My question is, how can I collect the top 10 words that has the highest TF-IDF value? It would be nice to make a column in my original dataframe (df) that contains the top 10 words for each row, but also know which words are the most important in total.


Solution

  • Minimal reproducible example for 20newsgroups dataset is:

    from sklearn.datasets import fetch_20newsgroups
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    X,y = fetch_20newsgroups(return_X_y = True)
    tfidf = TfidfVectorizer()
    X_tfidf = tfidf.fit_transform(X).toarray()
    vocab = tfidf.vocabulary_
    reverse_vocab = {v:k for k,v in vocab.items()}
    
    feature_names = tfidf.get_feature_names()
    df_tfidf = pd.DataFrame(X_tfidf, columns = feature_names)
    
    idx = X_tfidf.argsort(axis=1)
    
    tfidf_max10 = idx[:,-10:]
    
    df_tfidf['top10'] = [[reverse_vocab.get(item) for item in row] for row in tfidf_max10 ]
    
    df_tfidf['top10']
    
    0        [this, was, funky, rac3, bricklin, tellme, umd...
    1        [1qvfo9innc3s, upgrade, experiences, carson, k...
    2        [heard, anybody, 160, display, willis, powerbo...
    3        [joe, green, csd, iastate, jgreen, amber, p900...
    4        [tom, n3p, c5owcb, expected, std, launch, jona...
                                   ...                        
    11309    [millie, diagnosis, headache, factory, scan, j...
    11310    [plus, jiggling, screen, bodin, blank, mac, wi...
    11311    [weight, ended, vertical, socket, the, westes,...
    11312    [central, steven, steve, collins, bolson, hcrl...
    11313    [california, kjg, 2101240, willow, jh2sc281xpm...
    Name: top10, Length: 11314, dtype: object
    

    To get top 10 features with highest TfIdf, please use:

    global_top10_idx = X_tfidf.max(axis=0).argsort()[-10:]
    np.asarray(feature_names)[global_top10_idx]
    

    Please ask if something is not clear.