I'm currently trying to extract both the negation word and 3 words following the negation word.
i.e.)
"I don't want to visit again. no sympathy." (from a column called ReviewText2)
what I want: [don't want to visit, no sympathy.]
what I get: [don't, no]
I used the following code, but I don't know how to tweak it to include the followings words as well.
Negative_Reviews['ReviewText2'] = Negative_Reviews['Review Text'].str.lower()
keywords = ["doesn't","don't","without","won't","not","never","no","wasn't","isn't","can't","shouldn't","wouldn't","couldn't","nobody","nothing","neighter","nowhere"]
query = '|'.join(keywords)
Negative_Reviews['negation'] = Negative_Reviews['ReviewText2'] .str.findall(r'\b({})\b'.format(query))
I really appreciate your help!
You can use
rx = r'\b(?:{})\b(?:\s+\w+){{0,3}}'.format(query)
Negative_Reviews['negation'] = Negative_Reviews['ReviewText2'].str.findall(rx)
The regex will look like
\b(?:doesn't|don't|without|won't|not|never|no|wasn't|isn't|can't|shouldn't|wouldn't|couldn't|nobody|nothing|neighter|nowhere)\b(?:\s+\w+){0,3}
Details:
\b
-a word boundary(?:doesn't|don't|without|won't|not|never|no|wasn't|isn't|can't|shouldn't|wouldn't|couldn't|nobody|nothing|neighter|nowhere)
- one of the keywords\b
-a word boundary(?:\s+\w+){0,3}
- zero to three occurrences of one or more whitespaces and one or more word chars.