Search code examples
scikit-learnstatisticstf-idftfidfvectorizer

What is this output of Sklearn tfidf_vectorizer


First, I applied the tfidf_vectorizer onto my training data.

X_train_counts = tfidf_vectorizer.fit_transform(X_train)

Then I try to output the tf-idf value of the sentence 'programming'.

test = tfidf_vectorizer.transform(['programming']).reshape(1, -1)
print(test)

The result is

(0, 45295)  1.0

What this 1.0 represents? I thought it might be the tf-idf or idf value of the word 'programming' as the tf value in this case is 1.

Then I tried

test = tfidf_vectorizer.transform(['programming upgrade']).reshape(1, -1)
print(test)

The result is as follows.

(0, 60314)  0.7968362696657073
(0, 45295)  0.6041952990095505

If 1 is the tf-idf value, then, in this case, it should be 0.5 as the tf value is 1/2, but this is not the case.

So what this number represents? Seems not the tf value, not the idf value, and not the tf-idf value.

Confused


Solution

  • I think your issue is that the default setting for the tfidf_vectorizer is the norm "l2" instead of "l1".

    The output of the tfidf_vectorizer is the tf-idf matrix and the number is therefore the tf-idf value.

    By default, the tfidf_vectorizer utilized the 'l2' norm (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

    Here is some side by side comparison of the resulting tf-idf values:

    from sklearn.feature_extraction.text import TfidfVectorizer
    
    doc1 = ['programming upgrade']
    
    tfidf_l1 = TfidfVectorizer(norm='l1')
    tfidf_2 = TfidfVectorizer(norm='l2')
    
    tfidf_l1.fit(docs)
    tfidf_2.fit(docs)
    
    print("Output tfidf_transformer with l1 norm:")
    test = tfidf_l1.transform(['programming upgrade']).reshape(1, -1)
    print(test)
    
    print("Output tfidf_transformer with l2 norm:")
    test = tfidf_2.transform(['programming upgrade']).reshape(1, -1)
    print(test)
    

    And this returns:

    Output tfidf_transformer with l1 norm:
      (0, 1)    0.5
      (0, 0)    0.5
    Output tfidf_transformer with l2 norm:
      (0, 1)    0.7071067811865475
      (0, 0)    0.7071067811865475
    

    So just specify your tfidfVectorizer to use the norm "l1".