Search code examples
pythonscikit-learnnlptf-idftfidfvectorizer

Understanding TfidfVectorizer output


I'm testing TfidfVectorizer with simple example, and I can't figure out the results.

corpus = ["I'd like an apple",
          "An apple a day keeps the doctor away",
          "Never compare an apple to an orange",
          "I prefer scikit-learn to Orange",
          "The scikit-learn docs are Orange and Blue"]
vect = TfidfVectorizer(min_df=1, stop_words="english")
tfidf = vect.fit_transform(corpus)

print(vect.get_feature_names())    
print(tfidf.shape)
print(tfidf)

output:

['apple', 'away', 'blue', 'compare', 'day', 'docs', 'doctor', 'keeps', 'learn', 'like', 'orange', 'prefer', 'scikit']
(5, 13)
  (0, 0)    0.5564505207186616
  (0, 9)    0.830880748357988
  ...

I'm calculating the tfidf of the first sentence and I'm getting different results:

  • The first document ("I'd like an apple") contains just 2 words (after removeing stop words (according to the print of vect.get_feature_names() (we stay with: "like", "apple")
  • TF("apple", Doucment_1) = 1/2 = 0.5
  • TF("like", Doucment_1) = 1/2 = 0.5
  • The word apple appears 3 times in the corpus.
  • The word like appears 1 time in the corpus.
  • IDF ("apple") = ln(5/3) = 0.51082
  • IDF ("like") = ln(5/1) = 1.60943

so:

  • tfidf("apple") in document1 = 0.5 * 0.51082 = 0.255 != 0.5564
  • tfidf("like") in document1 = 0.5 * 1.60943 = 0.804 != 0.8308

What am I missing ?


Solution

  • There are several issues with your calculations.

    First, there are multiple conventions on how to calculate TF (see the Wikipedia entry); scikit-learn does not normalize it with the document length. From the user guide:

    [...] the term frequency, the number of times a term occurs in a given document [...]

    So, here, TF("apple", Document_1) = 1, and not 0.5

    Second, regarding the IDF definition - from the docs:

    If smooth_idf=True (the default), the constant “1” is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions: idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1.

    So, here we will have

    IDF ("apple") = ln(5+1/3+1) + 1 = 1.4054651081081644
    

    hence

    TF-IDF("apple") = 1 * 1.4054651081081644 =  1.4054651081081644
    

    Third, with the default setting norm='l2', there is an extra normalization taking place; from the docs again:

    Normalization is “c” (cosine) when norm='l2', “n” (none) when norm=None.

    Explicitly removing this extra normalization from your example, i.e.

    vect = TfidfVectorizer(min_df=1, stop_words="english", norm=None)
    

    gives for 'apple'

    (0, 0)  1.4054651081081644
    

    i.e. as already calculated manually

    For the details of how exactly the normalization affects the calculations when norm='l2' (the default setting), see the Tf–idf term weighting section of the user guide; by their own admission:

    the tf-idfs computed in scikit-learn’s TfidfTransformer and TfidfVectorizer differ slightly from the standard textbook notation