Search code examples
pythonregextestingmatchquantifiers

Regex: First word of sentence (following another sentence w/ unknown punctuation)


I need regex that will find the word when in all these sentences and any similar iteration.

  • "This is that." When did it happen? (ending in quotes/or FN call)
  • This is that. When did it happen? (note quotes are gone)
  • This is that. When did it happen? (notice the double space)
  • This is that. when did it happen? (notice the lowercase w)
  • This is that? When did it happen? (notice the question mark)

This code will match on the first iteration: (?<=\.\".)[a-zA-Z]*?(?=\s)

I'm mostly confused by the fact that my testing programs don't seem to let me use quantifiers or other modifiers within the look-back text. For example, I could do something like:

(?<=((\.)|(\!)|(\?))\"{0,1}\s{1,2})[a-zA-Z]*?(?=\s)

My problems with that text are:

1) It simply doesn't seem to process.

2) It doesn't seem like there is any easy way to make the quantifiers within the look-back lazy. In other words, even if it was processing, I'm not sure how it would make sense of (?<=((\.)|(\!)|(\?))\"{0,1}\s{1,2}?)[a-zA-Z]*?(?=\s)

3) I added the excessive parentheticals because I find it easier to read, but i'm not getting results w/ or w/o them. So they aren't the issue. As an aside, would they be an issue?


Solution

  • Since re module won't support variable length lookbehind, you could do capturing the string you want.

    (?:[.!)?])\"?\s{1,2}([a-zA-Z]+)(?=\s)
    

    DEMO

    >>> s = '''"This is that." When did it happen? (ending in quotes/or FN call)
    This is that. When did it happen? (note quotes are gone)
    This is that.  When did it happen? (notice the double space)
    This is that. when did it happen? (notice the lowercase w)
    This is that? When did it happen? (notice the question mark)'''
    >>> re.findall(r'(?:[.!)?])\"? {1,2}([a-zA-Z]+)(?=\s)', s)
    ['When', 'When', 'When', 'when', 'When']