Search code examples
pythonpandasfor-looplist-comprehensionpandas-apply

Why does this list comprehension only work in df.apply?


I'm trying to remove stopwords in my data. So it would go from this

data['text'].head(5)
Out[25]: 
0    go until jurong point, crazy.. available only ...
1                        ok lar... joking wif u oni...
2    free entry in 2 a wkly comp to win fa cup fina...
3    u dun say so early hor... u c already then say...
4    nah i don't think he goes to usf, he lives aro...
Name: text, dtype: object

to this

data['newt'].head(5)
Out[26]: 
0    [go, jurong, point,, crazy.., available, bugis...
1                 [ok, lar..., joking, wif, u, oni...]
2    [free, entry, 2, wkly, comp, win, fa, cup, fin...
3    [u, dun, say, early, hor..., u, c, already, sa...
4      [nah, think, goes, usf,, lives, around, though]
Name: newt, dtype: object

I have two options on how to do this. I'm trying both options separately so it won't overwrite anything. Firstly i'm applying a function to the data column. This works, it removes achieve what i wanted to do.

def process(data):
    data = data.lower()
    data = data.split()
    data = [row for row in data if row not in stopwords]
    return data

data['newt'] = data['text'].apply(process)

And second option in without using apply function parameter. It's exactly like the function but why it returns TypeError: unhashable type: 'list'? i check that if row not in stopwords in the line is what causing this because when i delete it, it runs but it doesn't do the stopwords removal

data['newt'] = data['text'].str.lower()
data['newt'] = data['newt'].str.split()
data['newt'] = [row for row in data['newt'] if row not in stopwords]

Solution

  • Your list comprehension fails because it checks if your entire dataframe row is in the stopwords list. This is never true, so what [row for row in data['newt'] if row not in stopwords] produces is simply the list of values in the original data['newt'] column.

    I think that following your logic, your last lines for stopwords removal may read

    data['newt'] = data['text'].str.lower()
    data['newt'] = data['newt'].str.split()
    data['newt'] = [[word for word in row if word not in stopwords] for row in data['newt']]
    

    If you are OK using apply, the last line can be replaced with

    data['newt'] = data['newt'].apply(lambda row: [word for word in row if word not in stopwords])
    

    Finally, you could also call

    data['newt'].apply(lambda row: " ".join(row))
    

    to get back strings at the end of the process.

    Mind that str.split may not be the best way to do tokenization, and you may opt for solutions using a dedicated library like spacy using a combination of removing stop words using spacy and adding custom stopwords with Add/remove custom stop words with spacy

    To convince yourself of the above argument, try out the following code:

    import spacy
    
    sent = "She said: 'beware, your sentences may contain a lot of funny chars!'"
    
    # spacy tokenization
    spacy.cli.download("en_core_web_sm")
    nlp = spacy.load('en_core_web_sm')
    doc = nlp(sent)
    print([token.text for token in doc])
    
    # simple split
    print(sent.split())
    

    and compare the two outputs.