My corpus is a series of documents with twitter data, and has been cleaned and pre-processed to be best of my knowledge (even including emoji)- example below:
0 [national, interest, think, worth, holding, ta...
1 [must, accurate, diane, abbott, done, calculat...
I then instantiate TFIDF:
# Instantiate vectoriser
vect = TfidfVectorizer()
# Fit
vect = TfidfVectorizer(min_df=10, ngram_range = (1,3)).fit(text)
When I try to fit this, I get:
AttributeError: 'list' object has no attribute 'lower'
But I've already converted everything to lower case. Is this something to do with the fact that it's a series?
Convert a collection of raw documents to a matrix of TF-IDF features.
You are passing in this sense a series of list
in your dataframe replicated here:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
l1 = 'national, interest, think, worth, holding,'.split(',')
l2 = 'must, accurate, diane, abbott, done'.split(',')
df = pd.DataFrame([[l1],[l2]])
text = df[0]
which returns your text parameter as:
0 [national, interest, think, worth, holding, ]
1 [must, accurate, diane, abbott, done]
Name: 0, dtype: object
This obviously will not work and as pointed out, TfidfVectorizer
accepts strings or documents. In your case and as per the example, although slightly counter-intuitive from your example.
corpus = text.apply(lambda x: ','.join(x)).to_list() # converts your series into a list of strings
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
['abbott', 'accurate', 'diane', 'done', 'holding', 'interest', 'must', 'national', 'think', 'worth']