Search code examples

How to extract calculations using tf-idf

I used TfidfVectorizer to extract TF-IDF but don't know how it calculates the results. When I calculate it manually, I get a different answer, so I want to extract the values ​​that the function calculates in order to learn how it works.

data = ['Souvenir shop|Architecture and art|Culture and history', 'Souvenir shop|Resort|Diverse cuisine|Fishing|Folk games|Beautiful scenery', 'Diverse cuisine|Resort|Beautiful scenery']

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(data)


  • Have a look in the scikit documentation at the attributes section.

    Try this:



    {'souvenir': 14,
     'shop': 13,
     'architecture': 1,
     'and': 0,
     'art': 2,
     'culture': 5,
     'history': 10,
     'resort': 11,
     'diverse': 6,
     'cuisine': 4,
     'fishing': 7,
     'folk': 8,
     'games': 9,
     'beautiful': 3,
     'scenery': 12}

    You get the idf calculations with print(vectorizer.idf_)


    array([1.69314718, 1.69314718, 1.69314718, 1.28768207, 1.28768207,
           1.69314718, 1.28768207, 1.69314718, 1.69314718, 1.69314718,
           1.69314718, 1.28768207, 1.28768207, 1.28768207, 1.28768207])

    For your case you can do this (with pandas):

    df_idf = pd.DataFrame(
        vectorizer.idf_, index=vectorizer.get_feature_names_out(), columns=["idf_weights"]


    and          1.693147
    architecture 1.693147
    art          1.693147
    beautiful    1.287682
    cuisine      1.287682
    culture      1.693147
    diverse      1.287682
    fishing      1.693147
    folk         1.693147
    games        1.693147
    history      1.693147
    resort       1.287682
    scenery      1.287682
    shop         1.287682
    souvenir     1.287682