Given a corpus of 3 documents, for example:
sentences = ["This car is fast",
"This car is pretty",
"Very fast truck"]
I am executing by hand the calculation of tf-idf.
For document 1, and the word "car", I can find that:
TF = 1/4
IDF = log(3/2)
TF-IDF = 1/4 * log(3/2)
Same result should apply to document 2, since it has 4 words, and one of them is "car".
I have tried to apply this in sklearn, with the code below:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
data = {'text': sentences}
df = pd.DataFrame(data)
tv = TfidfVectorizer()
tfvector = tv.fit_transform(df.text)
print(pd.DataFrame(tfvector.toarray(), columns=tv.get_feature_names()))
And the result I get is:
car fast is pretty this truck very
0 0.500000 0.50000 0.500000 0.000000 0.500000 0.000000 0.000000
1 0.459854 0.00000 0.459854 0.604652 0.459854 0.000000 0.000000
2 0.000000 0.47363 0.000000 0.000000 0.000000 0.622766 0.622766
I can understand that sklearn uses L2 normalization, but still, shouldn't the tf-idf score of "car" in the first two documents be the same? Can anyone help me understanding the results?
It is because of the normalization. If you add the parameter norm=None
to the TfIdfVectorizer(norm=None)
, you will get the following result, which has the same value for car
car fast is pretty this truck very
0 1.287682 1.287682 1.287682 0.000000 1.287682 0.000000 0.000000
1 1.287682 0.000000 1.287682 1.693147 1.287682 0.000000 0.000000
2 0.000000 1.287682 0.000000 0.000000 0.000000 1.693147 1.693147