Search code examples
pythonmachine-learningscikit-learnsklearn-pandas

Problems converting text input to numeric format with TfidfVectorizer of Sklearn


I'm trying to train a model with Sklearn. In short, I have a Pandas Dataframe with two columns, the 'review' where I have the input (text format) and the 'sentiment' column, but I having trouble converting text input to numeric format with TfidfVectorizer of Sklearn.

With the following code:

  from sklearn.feature_extraction.text import TfidfVectorizer
  tfidf = TfidfVectorizer(stop_words='english')
  train_x_vector = tfidf.fit_transform(train_x)
  test_x_vector = tfidf.transform(test_x)


from sklearn.svm import SVC

svc = SVC(kernel='linear')
svc.fit(train_x_vector, train_y)

I get the following error:

enter image description here

I have the suspicion that the problem is in converting the input to numeric data:

enter image description here

Any suggestion to solve it?

Thanks in advance!


Solution

  • TfidfVectorizer expects a list or an array of strings as input. In the code, train_x is a DataFrame and not a list or an array of strings.

    Solution:

    train_x_vector = tfidf.fit_transform(train_x['review'].values)