Search code examples
pythonregexnegative-lookbehind

Python Regex match so long as there's no character


I am having some trouble with another regular expression. For this one, my code is supposed to look for the pattern:

re.compile(r"kill(?:ed|ing|s)\D*(\d+).*?(?:men|women|children|people)?")

However, it is matching too aggressively. It happens to match a sentence which has the word 'killing' in it. But the pattern continues to collect until it reaches a digit further down in the text. In particular, it is matching:

killed in an apparent u.s. drone attack on a car in yemen on sunday, tribal sources and local officials said.the men's car was driving through the south-eastern province of maareb, a mostly desert region where militants have taken refuge after being driven from southern strongholds.yemen, where al qaeda militants exploited a security vacuum during last year's uprising that ousted president ali abdullah saleh, has seen an in10

This is not the behavior I'm after. I would like this pattern to fail if it cannot be found inside a single sentence.

The solution I'm trying implement in pseudo code is:

find instance of 'kill'
if what follows contains a period (\.) before a digit, do not match.

My failed implementation looks like this:

re.compile(r"kill(?:ed|ing|s)\D*(?!:\..*?)(\d+).*?(?:men|women|children|people)?")

I've tried a 'look-behind', but I have to specify a width. What I'm trying to do with the above is match any ending of 'kill', followed by any non-digit, but NOT match a period, and anything else is free to follow before the digit I'm after.

Sadly, this code behaves the exact same in my test. Any help would be appreciated.


Solution

  • A small modification:

    r"kill(?:ed|ing|s)[^\d.]*(\d+)[^.]*?(?:men|women|children|people)?"
    

    Basically, I prevent full stop . from being matched between kill and men/women/etc. following after.