Search code examples
pythonscikit-learnsklearn-pandastopic-modelingtfidfvectorizer

Can I input a pandas dataframe into "TfidfVectorizer"? If so, how do I find out how many documents are in my dataframe?


Here's the raw data:

raw data

Here's about the first half of the data after reading it into a pandas dataframe: pandas dataframe

I'm trying to run TfidfVectorizer but I keep getting the following error:

ValueError: After pruning, no terms remain. Try a lower min_df or a higher max_df.

I saw this post that said the error occurs when the max_df value is less than the min_df value in TfidfVectorizer. I have tried several variations where my max_df value is greater than my min_df value and still get the same error. So, I think the error might be related to how my data is stored in the pandas dataframe. Am I on the right track? If so, how do I find out how many documents I have in my dataframe? If not, how can I troubleshoot this?

Here's my code:

tfidf_vectorizer = TfidfVectorizer(max_df=0.5, min_df=0, stop_words=None)
tfidf = tfidf_vectorizer.fit_transform(df)

Also, here is the example I am working off of:

tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=n_features,
                                   stop_words='english')

In the above example, the min_df is greater than the max_df. I tried doing that exactly but got the following error:

ValueError: max_df corresponds to < documents than min_df

Solution

  • You should pass a column of data to the fit_transform function. Here is the example

    import pandas as pd
    from sklearn.feature_extraction.text import TfidfVectorizer
    words = ['trust inten other','feel comfort express view']
    df = pd.DataFrame(words,columns = ['words'])
    tfidf_vectorizer = TfidfVectorizer(max_df=0.5, min_df=0, stop_words=None)
    # right
    tfidf = tfidf_vectorizer.fit_transform(df['words'])
    # wrong
    # tf_idf = tf_idf_vectorizer.fit_transform(df)
    

    When you pass df to the fit_transform function, it will take ['words'] as input, instad of ['trust inten other','feel comfort express view'] as is showed in the example.