Search code examples
pythonnlpnltkstop-words

Stopword segmentation


I have a problem with nltk stopwords: if I do a cycle, stopword check on letter and not on word. How I can change this behaviour?

An example:

import pandas as pd
import nltk

stopword = nltk.corpus.stopwords.words('italian')
pd.set_option('display.max_colwidth', None)

df = pd.read_csv('esempioTweet.csv', sep =',')

def remove_stop(text):
    text = [word for word in text if word not in stopword]
    return text
df['Testo_no_stop'] = df['Testo_token'].apply(lambda x: remove_stop(x))
df.head()

given a previous column like this:

[covid, calano, i, nuovi, contagi, e, tamponi]

I expect an output like this:

[covid, calano, nuovi, contagi, tamponi]

but I have an output like:

[v,d,n, ...]

I understand that stopword is operating on a single letter and not on the whole word, but why? I'm sure that my remove_stop function works in a right way, but why stopword operates in a wrong one?


Solution

  • Your code uses for word in text which if text is a string returns one letter at a time.

    I simplified the code removing pandas as irrelevant - changed your remove_stop slightly to use word in text.split(), although I imagine nltk may have a method to split text into words which maybe you should use as for example it might remove punctuation which split() won't.

    import nltk
    
    stopwords = nltk.corpus.stopwords.words('italian')
    
    phrase = "oggi piove e non esco"
    
    def remove_stop(text):
        global stopwords
        text = [word for word in text.split() if word not in stopwords]
        return text
    
    res = remove_stop(phrase)
    print( f"{res=}" )
    

    Output:

    res=['oggi', 'piove', 'esco']
    

    BTW I don't think you need the lambda, just use:

    df['Testo_no_stop'] = df['Testo_token'].apply(remove_stop)
    

    Don't forget you can add debugging to a function like remove_stop(), which TBH is a good reason to use for loops rather than undebuggable comprehensions.

    Similarly you can print stopwords to check it is a list. It is.