I am having some trouble with another regular expression. For this one, my code is supposed to look for the pattern:
re.compile(r"kill(?:ed|ing|s)\D*(\d+).*?(?:men|women|children|people)?")
However, it is matching too aggressively. It happens to match a sentence which has the word 'killing' in it. But the pattern continues to collect until it reaches a digit further down in the text. In particular, it is matching:
killed in an apparent u.s. drone attack on a car in yemen on sunday, tribal sources and local officials said.the men's car was driving through the south-eastern province of maareb, a mostly desert region where militants have taken refuge after being driven from southern strongholds.yemen, where al qaeda militants exploited a security vacuum during last year's uprising that ousted president ali abdullah saleh, has seen an in10
This is not the behavior I'm after. I would like this pattern to fail if it cannot be found inside a single sentence.
The solution I'm trying implement in pseudo code is:
find instance of 'kill'
if what follows contains a period (\.) before a digit, do not match.
My failed implementation looks like this:
re.compile(r"kill(?:ed|ing|s)\D*(?!:\..*?)(\d+).*?(?:men|women|children|people)?")
I've tried a 'look-behind', but I have to specify a width. What I'm trying to do with the above is match any ending of 'kill', followed by any non-digit, but NOT match a period, and anything else is free to follow before the digit I'm after.
Sadly, this code behaves the exact same in my test. Any help would be appreciated.
A small modification:
r"kill(?:ed|ing|s)[^\d.]*(\d+)[^.]*?(?:men|women|children|people)?"
Basically, I prevent full stop .
from being matched between kill and men/women/etc. following after.