Here's the raw data:
Here's about the first half of the data after reading it into a pandas dataframe:
I'm trying to run TfidfVectorizer
but I keep getting the following error:
ValueError: After pruning, no terms remain. Try a lower min_df or a higher max_df.
I saw this post that said the error occurs when the max_df
value is less than the min_df
value in TfidfVectorizer
. I have tried several variations where my max_df
value is greater than my min_df
value and still get the same error. So, I think the error might be related to how my data is stored in the pandas dataframe. Am I on the right track? If so, how do I find out how many documents I have in my dataframe? If not, how can I troubleshoot this?
Here's my code:
tfidf_vectorizer = TfidfVectorizer(max_df=0.5, min_df=0, stop_words=None)
tfidf = tfidf_vectorizer.fit_transform(df)
Also, here is the example I am working off of:
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
max_features=n_features,
stop_words='english')
In the above example, the min_df
is greater than the max_df
. I tried doing that exactly but got the following error:
ValueError: max_df corresponds to < documents than min_df
You should pass a column of data to the fit_transform
function. Here is the example
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
words = ['trust inten other','feel comfort express view']
df = pd.DataFrame(words,columns = ['words'])
tfidf_vectorizer = TfidfVectorizer(max_df=0.5, min_df=0, stop_words=None)
# right
tfidf = tfidf_vectorizer.fit_transform(df['words'])
# wrong
# tf_idf = tf_idf_vectorizer.fit_transform(df)
When you pass df
to the fit_transform
function, it will take ['words']
as input, instad of ['trust inten other','feel comfort express view']
as is showed in the example.