I am trying to get the words with the 10 highest TF-IDF scores for each document.
I have a column in my dataframe that contains the preprocessed text (without punctuation, stop words, etc.) from my various documents. One row means one document in this example.
It has over 500 rows and I am curious about the most important words in each row.
So I ran the following code:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(df['liststring'])
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()
denselist = dense.tolist()
df2 = pd.DataFrame(denselist, columns=feature_names)
Which gives me a TF-IDF matrix:
My question is, how can I collect the top 10 words that has the highest TF-IDF value? It would be nice to make a column in my original dataframe (df) that contains the top 10 words for each row, but also know which words are the most important in total.
Minimal reproducible example for 20newsgroups
dataset is:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
X,y = fetch_20newsgroups(return_X_y = True)
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(X).toarray()
vocab = tfidf.vocabulary_
reverse_vocab = {v:k for k,v in vocab.items()}
feature_names = tfidf.get_feature_names()
df_tfidf = pd.DataFrame(X_tfidf, columns = feature_names)
idx = X_tfidf.argsort(axis=1)
tfidf_max10 = idx[:,-10:]
df_tfidf['top10'] = [[reverse_vocab.get(item) for item in row] for row in tfidf_max10 ]
df_tfidf['top10']
0 [this, was, funky, rac3, bricklin, tellme, umd...
1 [1qvfo9innc3s, upgrade, experiences, carson, k...
2 [heard, anybody, 160, display, willis, powerbo...
3 [joe, green, csd, iastate, jgreen, amber, p900...
4 [tom, n3p, c5owcb, expected, std, launch, jona...
...
11309 [millie, diagnosis, headache, factory, scan, j...
11310 [plus, jiggling, screen, bodin, blank, mac, wi...
11311 [weight, ended, vertical, socket, the, westes,...
11312 [central, steven, steve, collins, bolson, hcrl...
11313 [california, kjg, 2101240, willow, jh2sc281xpm...
Name: top10, Length: 11314, dtype: object
To get top 10 features with highest TfIdf, please use:
global_top10_idx = X_tfidf.max(axis=0).argsort()[-10:]
np.asarray(feature_names)[global_top10_idx]
Please ask if something is not clear.