Search code examples
tf-idftfidfvectorizer

The TF-IDF generated by the TfidfVectorizer of sklearn is incorrect?


This is my code:

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

textRaw = [
    "good boy girl",
    "good good good",
    "good boy",
    "good girl",
    "good bad girl",
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(textRaw)
allWords = vectorizer.get_feature_names_out()
dense = X.todense()
XList = dense.tolist()
df = pd.DataFrame(XList, columns=allWords)
dictionary = df.T.sum(axis=1)

print(dictionary)

Output:

bad 0.772536

boy 1.561542

girl 1.913661

good 2.870128

However, good appears in every document in the corpus. Its idf should be 0, which means its Tf-idf should also be 0. Why is the Tf-idf value of good calculated by TfidfVectorizer the highest?


Solution

  • From the sklearn documentation (emphasis mine):

    The formula that is used to compute the tf-idf for a term t of a document d in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and the idf is computed as idf(t) = log [ n / df(t) ] + 1 (if smooth_idf=False), where n is the total number of documents in the document set and df(t) is the document frequency of t; the document frequency is the number of documents in the document set that contain the term t. The effect of adding “1” to the idf in the equation above is that terms with zero idf, i.e., terms that occur in all documents in a training set, will not be entirely ignored. (Note that the idf formula above differs from the standard textbook notation that defines the idf as idf(t) = log [ n / (df(t) + 1) ]).