I have a code like this one:
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'This document is the fourth document.',
'And this is the fifth one.',
'This document is the sixth.',
'And this is the seventh one document.',
'This document is the eighth.',
'And this is the nineth one document.',
'This document is the second.',
'And this is the tenth one document.',
]
vectorizer = skln.TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
tfidf_matrix = X.toarray()
accumulated = [0] * len(vectorizer.get_feature_names())
for i in range(tfidf_matrix.shape[0]):
for j in range(len(vectorizer.get_feature_names())):
accumulated[j] += tfidf_matrix[i][j]
accumulated = sorted(accumulated)[-CENTRAL_TERMS:]
print(accumulated)
where I print the CENTRAL_TERMS
words which get the highest tf-idf scores over all the documents of the corpus.
However, I also want to get the MOST_REPEATED_TERMS
words over all the documents of the corpus. These are the words which have the highest tf scores. I know I can obtain by simply using CountVectorizer
, but I want to use only TfidfVectorizer
(in order to not performing first the vectorizer.fit_transform(corpus)
for the TfidfVectorizer
and then the vectorizer.fit_transform(corpus)
for the CountVectorizer
. I also know that I could use first CountVectorizer
(to obtain tf scores) followed by TfidfTransformer
(to obtain tf-idf scores). However, I think that there must be a way to this only using TfidfVectorizer
.
Let me know if there is a way to do this (any information is welcome).
By default, TfidfVectorizer
does the l2
normalization after multiplying the tf
and idf
. Hence we cannot get the term frequency, when you have the norm='l2'
. Refer here and here
If you can work without norm, then there is a solution.
import scipy.sparse as sp
import pandas as pd
vectorizer = TfidfVectorizer(norm=None)
X = vectorizer.fit_transform(corpus)
features = vectorizer.get_feature_names()
n = len(features)
inverse_idf = sp.diags(1/vectorizer.idf_,
offsets=0,
shape=(n, n),
format='csr',
dtype=np.float64).toarray()
pd.DataFrame(X*inverse_idf,
columns=features)