I'm testing TfidfVectorizer
with simple example, and I can't figure out the results.
corpus = ["I'd like an apple",
"An apple a day keeps the doctor away",
"Never compare an apple to an orange",
"I prefer scikit-learn to Orange",
"The scikit-learn docs are Orange and Blue"]
vect = TfidfVectorizer(min_df=1, stop_words="english")
tfidf = vect.fit_transform(corpus)
print(vect.get_feature_names())
print(tfidf.shape)
print(tfidf)
output:
['apple', 'away', 'blue', 'compare', 'day', 'docs', 'doctor', 'keeps', 'learn', 'like', 'orange', 'prefer', 'scikit']
(5, 13)
(0, 0) 0.5564505207186616
(0, 9) 0.830880748357988
...
I'm calculating the tfidf
of the first sentence and I'm getting different results:
I'd like an apple
") contains just 2 words (after removeing stop words (according to the print of vect.get_feature_names()
(we stay with: "like
", "apple
")apple
appears 3 times in the corpus.like
appears 1 time in the corpus.so:
tfidf("apple")
in document1 = 0.5 * 0.51082 = 0.255 != 0.5564tfidf("like")
in document1 = 0.5 * 1.60943 = 0.804 != 0.8308What am I missing ?
There are several issues with your calculations.
First, there are multiple conventions on how to calculate TF (see the Wikipedia entry); scikit-learn does not normalize it with the document length. From the user guide:
[...] the term frequency, the number of times a term occurs in a given document [...]
So, here, TF("apple", Document_1) = 1
, and not 0.5
Second, regarding the IDF definition - from the docs:
If
smooth_idf=True
(the default), the constant “1” is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions: idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1.
So, here we will have
IDF ("apple") = ln(5+1/3+1) + 1 = 1.4054651081081644
hence
TF-IDF("apple") = 1 * 1.4054651081081644 = 1.4054651081081644
Third, with the default setting norm='l2'
, there is an extra normalization taking place; from the docs again:
Normalization is “c” (cosine) when
norm='l2'
, “n” (none) whennorm=None
.
Explicitly removing this extra normalization from your example, i.e.
vect = TfidfVectorizer(min_df=1, stop_words="english", norm=None)
gives for 'apple'
(0, 0) 1.4054651081081644
i.e. as already calculated manually
For the details of how exactly the normalization affects the calculations when norm='l2'
(the default setting), see the Tf–idf term weighting section of the user guide; by their own admission:
the tf-idfs computed in scikit-learn’s
TfidfTransformer
andTfidfVectorizer
differ slightly from the standard textbook notation