I have been trying to perform sentiment analysis over a movie reviews dataset and I am stuck at a point where I am unable to remove english stopwords from the data. What am I doing wrong?
from nltk.corpus import stopwords
stop = stopwords.words("English")
list_ = []
for file_ in dataset:
dataset['Content'] = dataset['Content'].apply(lambda x: [item for item in x.split(',') if item not in stop])
list_.append(dataset)
dataset = pd.concat(list_, ignore_index=True)
I think the code should work with information so far. The assumption I am making is with data has extra space while separated with comma. Below is the test ran: (hope it helps!)
import pandas as pd
from nltk.corpus import stopwords
import nltk
stop = nltk.corpus.stopwords.words('english')
dataset = pd.DataFrame([{'Content':'i, am, the, computer, machine'}])
dataset = dataset.append({'Content':'i, play, game'}, ignore_index=True)
print(dataset)
list_ = []
for file_ in dataset:
dataset['Content'] = dataset['Content'].apply(lambda x: [item.strip() for item in x.split(',') if item.strip() not in stop])
list_.append(dataset)
dataset = pd.concat(list_, ignore_index=True)
print(dataset)
Input with stopwords:
Content
0 i, am, the, computer, machine
1 i, play, game
Output:
Content
0 [computer, machine]
1 [play, game]