i have multicolumn dataframe with 41,000 rows of Flickr tags. I want to remove all the English stopwords from only a column, leaving other columns intact.
this is my code extracting list of stopwords from nltk.corpus:
from nltk.corpus import stopwords
stopWordsListEng = stopwords.words("english")
but i want to add additional stopwords that i can think of:
according accordingly across act actually
i haven't figured out how to add that to the existing list of stopwords.
and how do i apply lambda to remove stopwords in only one column. Because i want my code to be as simple as possible.
here is how my column looks like:
column1 column2 column3
some words from this column i don't know actually what across to me accordingly 25,000
i want my column to look like this (more or less) after i remove all the stopwords:
column1 column2 column3
some words from this column don't know what to me 25,000
you can add additional stopwords to existing one using list extend
_new_stopwords_to_add = ['according', 'accordingly', 'across', 'act', 'actually']
stopWordsListEng.extend(_new_stopwords_to_add)
remove stopwords from one pandas column only using pandas.DataFrame.apply
df['column2'] = df['column2'].apply(lambda x: ' '.join([item for item in x.split() if item not in stopWordsListEng]))