Search code examples
pythonlambdatagsflickrstop-words

Append custom stopwords to default stopwords list from nltk.corpus and remove stopwords from a series in a dataframe using lambda


i have multicolumn dataframe with 41,000 rows of Flickr tags. I want to remove all the English stopwords from only a column, leaving other columns intact.

this is my code extracting list of stopwords from nltk.corpus:

from nltk.corpus import stopwords
stopWordsListEng = stopwords.words("english")

but i want to add additional stopwords that i can think of:

according accordingly across act actually

i haven't figured out how to add that to the existing list of stopwords.

and how do i apply lambda to remove stopwords in only one column. Because i want my code to be as simple as possible.

here is how my column looks like:

column1                        column2                                                 column3
some words from this column    i don't know actually what across to me accordingly     25,000

i want my column to look like this (more or less) after i remove all the stopwords:

column1                        column2                column3
some words from this column    don't know what to me  25,000

Solution

  • you can add additional stopwords to existing one using list extend

    _new_stopwords_to_add = ['according', 'accordingly', 'across', 'act', 'actually']
    stopWordsListEng.extend(_new_stopwords_to_add)
    

    remove stopwords from one pandas column only using pandas.DataFrame.apply

    df['column2'] = df['column2'].apply(lambda x: ' '.join([item for item in x.split() if item not in stopWordsListEng]))