I have been running the TF-IDF Vectorizer from SKLearn but am having trouble recreating the values manually (as an aid to understanding what is happening).
To add some context, i have a list of documents that I have extracted named entities from (in my actual data these go up to 5-grams but here I have restricted this to bigrams). I only want to know the TF-IDF scores for these values and thought passing these terms via the vocabulary
parameter would do this.
Here is some dummy data similar to what I am working with:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
# list of named entities I want to generate TF-IDF scores for
named_ents = ['boston','america','france','paris','san francisco']
# my list of documents
docs = ['i have never been to boston',
'boston is in america',
'paris is the capitol city of france',
'this sentence has no named entities included',
'i have been to san francisco and paris']
# find the max nGram in the named entity vocabulary
ne_vocab_split = [len(word.split()) for word in named_ents]
max_ngram = max(ne_vocab_split)
tfidf = TfidfVectorizer(vocabulary = named_ents, stop_words = None, ngram_range=(1,max_ngram))
tfidf_vector = tfidf.fit_transform(docs)
output = pd.DataFrame(tfidf_vector.T.todense(), index=named_ents, columns=docs)
Note: I know stop-words are removed by default, but some of the named entities in my actual data-set include phrases such as 'the state department'. So they have been kept here.
Here is where I need some help. I'm of the understanding that we calculate the TF-IDF as follows:
TF: term frequency: which according to SKlearn guidelines the is "the number of times a term occurs in a given document"
IDF: inverse document frequency: the natural log of the ratio of 1+the number of documents, and 1+the number of documents containing the term. According to the same guidelines in the link, the resultant value has a 1 added to prevent division by zero.
We then multiply the TF by the IDF to give the overall TF-IDF for the a given term, in a given document.
Example
Let's take the first column as an example, which has only one named entity 'Boston', and according to the above code has a TF-IDF on the first document of 1. However, when I work this out manually I get the following:
TF = 1
IDF = log-e(1+total docs / 1+docs with 'boston') + 1
' ' = log-e(1+5 / 1+2) + 1
' ' = log-e(6 / 3) + 1
' ' = log-e(2) + 1
' ' = 0.69314 + 1
' ' = 1.69314
TF-IDF = 1 * 1.69314 = 1.69314 (not 1)
Perhaps I'm missing something in the documentation that says scores are capped at 1, but I cannot work out where I've gone wrong. Furthermore, with the above calculations, there shouldn't be any difference between the score for Boston in the first column, and the second column, as the term only appears once in each document.
Edit
After posting the question I thought that maybe the Term Frequency was calculated as a ratio with either the number of unigrams in the document, or the number of named entities in the document. For example, in the second document SKlearn generates a score for Boston of 0.627914
. If I calculate the TF as a ratio of tokens = 'boston' (1) : all unigram tokens (4) I get a TF of 0.25
, which when I apply to the TF-IDF returns a score just over 0.147
.
Similarly, when I use a ratio of tokens = 'boston' (1) : all NE tokens (2) and apply the TF-IDF I get a score of 0.846
. So clearly I am going wrong somewhere.
Let's do this this mathematical exercise one step at a time.
Step 1. Get tfidf scores for boston
token
docs = ['i have never been to boston',
'boston is in america',
'paris is the capitol city of france',
'this sentence has no named entities included',
'i have been to san francisco and paris']
from sklearn.feature_extraction.text import TfidfVectorizer
# I did not include your named_ents here but did for a full vocab
tfidf = TfidfVectorizer(smooth_idf=True,norm='l1')
Note the params in TfidfVectorizer
, they are important for smoothing and normalization later.
docs_tfidf = tfidf.fit_transform(docs).todense()
n = tfidf.vocabulary_["boston"]
docs_tfidf[:,n]
matrix([[0.19085885],
[0.22326669],
[0. ],
[0. ],
[0. ]])
What we've got so far, tfidf scores for boston
token (#3 in vocab).
Step 2.Calculate tfidf for boston
token w/o norm.
The formulae are:
tf-idf(t, d) = tf(t, d) * idf(t)
idf(t) = log( (n+1) / (df(t)+1) ) + 1
where:
- tf(t,d) -- simple term t frequency in document d
- idf(t) -- smoothed inversed document frequency (because ofsmooth_idf=True
param)
Counting the token boston
in 0th document and # of documents it appears in:
tfidf_boston_wo_norm = ((1/5) * (np.log((1+5)/(1+2))+1))
tfidf_boston_wo_norm
0.3386294361119891
Note, i
does not count as a token according to builtin tokenization scheme.
Step 3. Normalization
Let's do l1
normalization first, i.e. all calculated non-normalized tfdid's should sum up to 1 by row:
l1_norm = ((1/5) * (np.log((1+5)/(1+2))+1) +
(1/5) * (np.log((1+5)/(1+1))+1) +
(1/5) * (np.log((1+5)/(1+2))+1) +
(1/5) * (np.log((1+5)/(1+2))+1) +
(1/5) * (np.log((1+5)/(1+2))+1))
tfidf_boston_w_l1_norm = tfidf_boston_wo_norm/l1_norm
tfidf_boston_w_l1_norm
0.19085884520912985
As you see, we are getting the same tfidf score as above.
Let's now do the same math for l2
norm.
Benchmark:
tfidf = TfidfVectorizer(sublinear_tf=True,norm='l2')
docs_tfidf = tfidf.fit_transform(docs).todense()
docs_tfidf[:,n]
matrix([[0.42500138],
[0.44400208],
[0. ],
[0. ],
[0. ]])
Calculus:
l2_norm = np.sqrt(((1/5) * (np.log((1+5)/(1+2))+1))**2 +
((1/5) * (np.log((1+5)/(1+1))+1))**2 +
((1/5) * (np.log((1+5)/(1+2))+1))**2 +
((1/5) * (np.log((1+5)/(1+2))+1))**2 +
((1/5) * (np.log((1+5)/(1+2))+1))**2
)
tfidf_boston_w_l2_norm = tfidf_boston_wo_norm/l2_norm
tfidf_boston_w_l2_norm
0.42500137513291814
It's still the same as a may see.