Search code examples
pythonscikit-learnnlptf-idf

Calculate TF-IDF using sklearn for n-grams in python


I have a vocabulary list that include n-grams as follows.

myvocabulary = ['tim tam', 'jam', 'fresh milk', 'chocolates', 'biscuit pudding']

I want to use these words to calculate TF-IDF values.

I also have a dictionary of corpus as follows (key = recipe number, value = recipe).

corpus = {1: "making chocolates biscuit pudding easy first get your favourite biscuit chocolates", 2: "tim tam drink new recipe that yummy and tasty more thicker than typical milkshake that uses normal chocolates", 3: "making chocolates drink different way using fresh milk egg"}

I am currently using the following code.

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(vocabulary = myvocabulary, stop_words = 'english')
tfs = tfidf.fit_transform(corpus.values())

Now I am printing tokens or n-grams of the recipe 1 in corpus along with the tF-IDF value as follows.

feature_names = tfidf.get_feature_names()
doc = 0
feature_index = tfs[doc,:].nonzero()[1]
tfidf_scores = zip(feature_index, [tfs[doc, x] for x in feature_index])
for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:
  print(w, s)

The results I get is chocolates 1.0. However, my code does not detect n-grams (bigrams) such as biscuit pudding when calculating TF-IDF values. Please let me know where I make the code wrong.

I want to get the TD-IDF matrix for myvocabulary terms by using the recipe documents in the corpus. In other words, the rows of the matrix represents myvocabulary and the columns of the matrix represents the recipe documents of my corpus. Please help me.


Solution

  • Try increasing the ngram_range in TfidfVectorizer:

    tfidf = TfidfVectorizer(vocabulary = myvocabulary, stop_words = 'english', ngram_range=(1,2))
    

    Edit: The output of TfidfVectorizer is the TF-IDF matrix in sparse format (or actually the transpose of it in the format you seek). You can print out its contents e.g. like this:

    feature_names = tfidf.get_feature_names()
    corpus_index = [n for n in corpus]
    rows, cols = tfs.nonzero()
    for row, col in zip(rows, cols):
        print((feature_names[col], corpus_index[row]), tfs[row, col])
    

    which should yield

    ('biscuit pudding', 1) 0.646128915046
    ('chocolates', 1) 0.763228291628
    ('chocolates', 2) 0.508542320378
    ('tim tam', 2) 0.861036995944
    ('chocolates', 3) 0.508542320378
    ('fresh milk', 3) 0.861036995944
    

    If the matrix is not large, it might be easier to examine it in dense form. Pandas makes this very convenient:

    import pandas as pd
    df = pd.DataFrame(tfs.T.todense(), index=feature_names, columns=corpus_index)
    print(df)
    

    This results in

                            1         2         3
    tim tam          0.000000  0.861037  0.000000
    jam              0.000000  0.000000  0.000000
    fresh milk       0.000000  0.000000  0.861037
    chocolates       0.763228  0.508542  0.508542
    biscuit pudding  0.646129  0.000000  0.000000