Search code examples
pythonpandasnltksentiment-analysisstop-words

Unable to remove english stopwords from a dataframe


I have been trying to perform sentiment analysis over a movie reviews dataset and I am stuck at a point where I am unable to remove english stopwords from the data. What am I doing wrong?

from nltk.corpus import stopwords
stop = stopwords.words("English")
list_ = []
for file_ in dataset:
    dataset['Content'] = dataset['Content'].apply(lambda x: [item for item in x.split(',') if item not in stop])
    list_.append(dataset)
dataset = pd.concat(list_, ignore_index=True)

Solution

  • I think the code should work with information so far. The assumption I am making is with data has extra space while separated with comma. Below is the test ran: (hope it helps!)

    import pandas as pd
    from nltk.corpus import stopwords
    import nltk
    
    stop = nltk.corpus.stopwords.words('english')
    
    dataset = pd.DataFrame([{'Content':'i, am, the, computer, machine'}])
    dataset = dataset.append({'Content':'i, play, game'}, ignore_index=True)
    print(dataset)
    list_ = []
    for file_ in dataset:
        dataset['Content'] = dataset['Content'].apply(lambda x: [item.strip() for item in x.split(',') if item.strip() not in stop])
        list_.append(dataset)
    dataset = pd.concat(list_, ignore_index=True)
    
    print(dataset)
    

    Input with stopwords:

                              Content
    0   i, am, the, computer, machine
    1                   i, play, game
    

    Output:

                    Content
     0  [computer, machine]
     1         [play, game]