Search code examples
pythonnlptf-idftfidfvectorizer

Error in fit_transform while finding tf-idf in Python


import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
mylist = [
    'a a b c',
    'a c c c d e f',
    'a c d d d',
    'a d f',
]
df = pd.DataFrame({"texts": mylist})
tfidf_vectorizer = TfidfVectorizer(ngram_range=[1, 1])
tfidf_separate = tfidf_vectorizer.fit_transform(df["texts"])

I am trying to find tf-idf value for “d” in line 3. But, it is showing me empty vocabulary error "ValueError: empty vocabulary; perhaps the documents only contain stop words".

Any advice on how to resolve the error would be appreciated!


Solution

  • You can do it like this:

    • define analyzer='char' so that TfidfVectorizer works with the letters;
    • find the index of d in the vocabulary and use it
    import pandas as pd
    from sklearn.feature_extraction.text import TfidfVectorizer
    mylist = [
        'a a b c',
        'a c c c d e f',
        'a c d d d',
        'a d f',
    ]
    df = pd.DataFrame({"texts": mylist})
    tfidf_vectorizer = TfidfVectorizer(ngram_range=[1, 1], analyzer='char')
    tfidf_separate = tfidf_vectorizer.fit_transform(df["texts"])
    ind = tfidf_vectorizer.vocabulary_['d']
    tfidf_separate.todense()[2, ind]
    >>> 0.6490674853546846