I want to create a text classifer that looks at research abstracts and determines whether they are focused on access to care, based on a labeled dataset I have. The data source is an Excel spreadsheet, with three fields (project_number, abstract, and accessclass) and 326 rows of abstracts. The accessclass is 1 for access related and 0 for not access related (not sure if this is relevant). Anyway, I tried following along a tutorial by wanted to make it relevant by adding my own data and I'm having some issues with my X and Y arrays. Any help is appreciated.
import pandas as pd
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import naive_bayes
from sklearn.metrics import roc_auc_score
df = pd.read_excel("accessclasses.xlsx")
df.head()
#TFIDF vectorizer
stopset = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(use_idf=True, lowercase=True,
strip_accents='ascii', stop_words=stopset)
y = df.accessclass
x = vectorizer.fit_transform(df)
print(x.shape)
print(y.shape)
#above and below seem to be where the issue is.
x_train, x_test, y_train, y_test = train_test_split(x, y)
You are using your whole dataframe to encode your predictor. Remember to use only the abstract in the transformation (you could also fit the corpus word dictionary before and then transform it afterwards).
Here's a solution:
y = df.accessclass
x = vectorizer.fit_transform(df.abstract)
The rest looks ok.