Search code examples
pythonpandasdataframestop-words

Removing stopwords from a pandas series based on list


I have the following data frame called sentences

data = ["Home of the Jacksons"], ["Is it the real thing?"], ["What is it with you?"], [ "Tomatoes are the best"] [ "I think it's best to path ways now"]


sentences = pd.DataFrame(data, columns = ['sentence'])

And a dataframe called stopwords:

data = [["the"], ["it"], ["best"], [ "is"]]

stopwords = pd.DataFrame(data, columns = ['word'])

I want to remove all stopwords from sentences["sentence"]. I tried the code below but it does not work. I think there is an issue with my if statement. Can anyone help?

Def remove_stopwords(input_string, stopwords_list): 
    stopwords_list = list(stopwords_list)
    my_string_split = input_string.split(' ')
    my_string = []
    for word in my_string_split: 
        if word not in stopwords_list: 
            my_string.append(word)
        my_string = " ".join(my_string)
        return my_string

sentence['cut_string']= sentence.apply(lambda row: remove_stopwords(row['sentence'], stopwords['word']), axis=1)

When I apply the function, it just returns the first or first few strings in the sentence but does not cut out stopwords at all. Kinda stuck here


Solution

  • You can convert stopwords word to list and remove those words from sentences using list comprehension,

    stopword_list = stopwords['word'].tolist()
    
    sentences['filtered] = sentences['sentence'].apply(lambda x: ' '.join([i for i in x.split() if i not in stopword_list]))
    

    You get

    0                 Home of Jacksons
    1                   Is real thing?
    2                   What with you?
    3                     Tomatoes are
    4    I think it's to path ways now
    

    Or you can wrap the code in a function,

    def remove_stopwords(input_string, stopwords_list):     
        my_string = []
        for word in input_string.split(): 
            if word not in stopwords_list: 
                my_string.append(word)
    
        return " ".join(my_string)
    
    stopword_list = stopwords['word'].tolist()
    sentences['sentence'].apply(lambda row: remove_stopwords(row, stopword_list))