Search code examples
pythonregextextnlpsentiment-analysis

Extract Both Negation & 3 Following Words ( Python/DataFrame)


I'm currently trying to extract both the negation word and 3 words following the negation word.

i.e.)

"I don't want to visit again. no sympathy." (from a column called ReviewText2)

what I want: [don't want to visit, no sympathy.]

what I get: [don't, no]

I used the following code, but I don't know how to tweak it to include the followings words as well.

Negative_Reviews['ReviewText2'] =   Negative_Reviews['Review Text'].str.lower()   

keywords = ["doesn't","don't","without","won't","not","never","no","wasn't","isn't","can't","shouldn't","wouldn't","couldn't","nobody","nothing","neighter","nowhere"]

query = '|'.join(keywords)
Negative_Reviews['negation'] = Negative_Reviews['ReviewText2'] .str.findall(r'\b({})\b'.format(query))

I really appreciate your help!


Solution

  • You can use

    rx = r'\b(?:{})\b(?:\s+\w+){{0,3}}'.format(query)
    Negative_Reviews['negation'] = Negative_Reviews['ReviewText2'].str.findall(rx)
    

    The regex will look like

    \b(?:doesn't|don't|without|won't|not|never|no|wasn't|isn't|can't|shouldn't|wouldn't|couldn't|nobody|nothing|neighter|nowhere)\b(?:\s+\w+){0,3}
    

    Details:

    • \b -a word boundary
    • (?:doesn't|don't|without|won't|not|never|no|wasn't|isn't|can't|shouldn't|wouldn't|couldn't|nobody|nothing|neighter|nowhere) - one of the keywords
    • \b -a word boundary
    • (?:\s+\w+){0,3} - zero to three occurrences of one or more whitespaces and one or more word chars.