Using gensim
:
from gensim.models import TfidfModel
from gensim.corpora import Dictionary
sent0 = "The quick brown fox jumps over the lazy brown dog .".lower().split()
sent1 = "Mr brown jumps over the lazy fox .".lower().split()
dataset = [sent0, sent1]
vocab = Dictionary(dataset)
corpus = [vocab.doc2bow(sent) for sent in dataset]
model = TfidfModel(corpus)
# To retrieve the same pd.DataFrame format.
documents_tfidf_lol = [{vocab[word_idx]:tfidf_value for word_idx, tfidf_value in sent} for sent in model[corpus]]
documents_tfidf = pd.DataFrame(documents_tfidf_lol)
documents_tfidf.fillna(0, inplace=True)
documents_tfidf
[out]:
dog mr quick
0 0.707107 0.0 0.707107
1 0.000000 1.0 0.000000
If we do the TF-IDF computation manually,
sent0 = "The quick brown fox jumps over the lazy brown dog .".lower().split()
sent1 = "Mr brown jumps over the lazy fox .".lower().split()
documents = pd.DataFrame.from_dict(list(map(Counter, [sent0, sent1])))
documents.fillna(0, inplace=True, downcast='infer')
documents = documents.apply(lambda x: x/sum(x)) # Normalize the TF.
documents.head()
# To compute the IDF for all words.
num_sentences, num_words = documents.shape
idf_vector = [] # Lets save an ordered list of IDFS w.r.t. order of the column names.
for word in documents:
word_idf = math.log(num_sentences/len(documents[word].nonzero()[0]))
idf_vector.append(word_idf)
# Compute the TF-IDF table.
documents_tfidf = pd.DataFrame(documents.as_matrix() * np.array(idf_vector),
columns=list(documents))
documents_tfidf
[out]:
. brown dog fox jumps lazy mr over quick the
0 0.0 0.0 0.693147 0.0 0.0 0.0 0.000000 0.0 0.693147 0.0
1 0.0 0.0 0.000000 0.0 0.0 0.0 0.693147 0.0 0.000000 0.0
If we use math.log2
instead of math.log
:
. brown dog fox jumps lazy mr over quick the
0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
It looks like gensim
:
print(model[corpus])
Looking at https://radimrehurek.com/gensim/models/tfidfmodel.html#gensim.models.tfidfmodel.TfidfModel , the smart
scheme difference would have output different values but it's not clear in the docs what is the default value.
What is the default smartirs for gensim TfidfModel?
What are the other default parameters that've caused the difference between a natively implemented TF-IDF and gensim's?
The default value of smartirs
is None, but if you follow the code, it is equal to ntc.
But how?
First, when you call model = TfidfModel(corpus)
, it calculates IDF of the corpus with a function called wglobal
which explained in docs as:
wglobal
is function for global weighting, the default value is df2idf()
. df2idf
is a function that computes IDF for a term with the given document frequency. The default arguman and formula for df2idf
is:
df2idf(docfreq, totaldocs, log_base=2.0, add=0.0)
which implemented as:
idfs = add + np.log(float(totaldocs) / docfreq) / np.log(log_base)
One of the smartirs is determined: document frequency weighting is inverse-document-frequency or idf
.
wlocals
by default is identity
function. Term frequency of the corpus passed through the identify function which nothing happened, and the corpus itself return. Hence, another parameter of smartirs, term frequency weighing, is natural or n
. Now that we have term frequency and inverse-document-frequency we can compute tfidf:
normalize
by default is true that means after computing TfIDF it normalizes the tfidf vectors. The normalization is done with l2-norm
(Euclidean unit norm) which means our last smartirs is cosine or c
. This part implemented as:
# vec(term_id, value) is tfidf result
length = 1.0 * math.sqrt(sum(val ** 2 for _, val in vec))
normalize_by_length = [(termid, val / length) for termid, val in vec]
When you call model[corpus]
or model.__getitem__()
the following things happen:
__getitem__
has a eps
argument which is a threshold value that will remove all entries that have tfidf-value less than eps
. By default, this value is 1e-12. As a result, when you print the vectors only some of them appeared.