python nlp gensim information-retrieval tf-idf

What is the default smartirs for gensim TfidfModel?

Using gensim:

from gensim.models import TfidfModel
from gensim.corpora import Dictionary

sent0 = "The quick brown fox jumps over the lazy brown dog .".lower().split()
sent1 = "Mr brown jumps over the lazy fox .".lower().split()

dataset = [sent0, sent1]
vocab = Dictionary(dataset)
corpus = [vocab.doc2bow(sent) for sent in dataset] 
model = TfidfModel(corpus)

# To retrieve the same pd.DataFrame format.
documents_tfidf_lol = [{vocab[word_idx]:tfidf_value for word_idx, tfidf_value in sent} for sent in model[corpus]]
documents_tfidf = pd.DataFrame(documents_tfidf_lol)
documents_tfidf.fillna(0, inplace=True)

documents_tfidf

[out]:

    dog mr  quick
0   0.707107    0.0 0.707107
1   0.000000    1.0 0.000000

If we do the TF-IDF computation manually,

sent0 = "The quick brown fox jumps over the lazy brown dog .".lower().split()
sent1 = "Mr brown jumps over the lazy fox .".lower().split()

documents = pd.DataFrame.from_dict(list(map(Counter, [sent0, sent1])))
documents.fillna(0, inplace=True, downcast='infer')
documents = documents.apply(lambda x: x/sum(x))  # Normalize the TF.
documents.head()

# To compute the IDF for all words.
num_sentences, num_words = documents.shape

idf_vector = [] # Lets save an ordered list of IDFS w.r.t. order of the column names.

for word in documents:
  word_idf = math.log(num_sentences/len(documents[word].nonzero()[0]))
  idf_vector.append(word_idf)

# Compute the TF-IDF table.
documents_tfidf = pd.DataFrame(documents.as_matrix() * np.array(idf_vector), 
                               columns=list(documents))
documents_tfidf

[out]:

    .   brown   dog fox jumps   lazy    mr  over    quick   the
0   0.0 0.0 0.693147    0.0 0.0 0.0 0.000000    0.0 0.693147    0.0
1   0.0 0.0 0.000000    0.0 0.0 0.0 0.693147    0.0 0.000000    0.0

If we use math.log2 instead of math.log:

    .   brown   dog fox jumps   lazy    mr  over    quick   the
0   0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
1   0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0

It looks like gensim:

remove the non-salient words from the TF-IDF model, it's evident when we print(model[corpus])
maybe the log base seem to be different from the log_2
maybe there's some normalization going on.

Looking at https://radimrehurek.com/gensim/models/tfidfmodel.html#gensim.models.tfidfmodel.TfidfModel , the smart scheme difference would have output different values but it's not clear in the docs what is the default value.

What is the default smartirs for gensim TfidfModel?

What are the other default parameters that've caused the difference between a natively implemented TF-IDF and gensim's?

Solution

The default value of smartirs is None, but if you follow the code, it is equal to ntc.

But how?

First, when you call model = TfidfModel(corpus), it calculates IDF of the corpus with a function called wglobal which explained in docs as:

wglobal is function for global weighting, the default value is df2idf(). df2idf is a function that computes IDF for a term with the given document frequency. The default arguman and formula for df2idf is:

df2idf(docfreq, totaldocs, log_base=2.0, add=0.0)

which implemented as:

idfs = add + np.log(float(totaldocs) / docfreq) / np.log(log_base)

One of the smartirs is determined: document frequency weighting is inverse-document-frequency or idf.

wlocals by default is identity function. Term frequency of the corpus passed through the identify function which nothing happened, and the corpus itself return. Hence, another parameter of smartirs, term frequency weighing, is natural or n. Now that we have term frequency and inverse-document-frequency we can compute tfidf:

normalize by default is true that means after computing TfIDF it normalizes the tfidf vectors. The normalization is done with l2-norm (Euclidean unit norm) which means our last smartirs is cosine or c. This part implemented as:

# vec(term_id, value) is tfidf result
length = 1.0 * math.sqrt(sum(val ** 2 for _, val in vec))
normalize_by_length = [(termid, val / length) for termid, val in vec]

When you call model[corpus] or model.__getitem__() the following things happen:

__getitem__ has a eps argument which is a threshold value that will remove all entries that have tfidf-value less than eps. By default, this value is 1e-12. As a result, when you print the vectors only some of them appeared.