Search code examples
pythonregexnlppython-relinguistics

Python regex selection of verbs with present perfect


In a given string, I am trying to catch verbs that are in present pefect tense. I do that by using the following regular expression in python:

import re
sentence = "The Batman has never shown his true identity but has done so much good for Gotham City"

verb = re.findall(r'has\s[^\,\.\"]{0,50}done', sentence)

And the outcome is:

>>> print(verb)

['has never shown his true identity but has done']

Here, the correct answer would have been 'has done', but the 'has' from 'has never shown' is the wrong 'has' catched. The part [^\,\.\"]{0,50} permits some freedom with respect to what is between 'has' and 'done', which does not appear here but is useful on my real data. However, it catches the first 'has' it finds, which is not always the good one. Is it possible to take the last 'has' instead ?


Solution

  • You can use a tempered greedy token solution here:

    \bhas\s(?:(?!\bhas\b)[^,."]){0,50}?\bdone\b
    

    See the regex demo.

    Details

    • \bhas - a whole word has
    • \s - one whitespace char
    • (?:(?!\bhas\b)[^,."]){0,50}? - any char but ,, . or ", zero to fifty occurrences but as few as possible, that does not start a whole word has
    • \bdone\b - a whole word done.

    See a Python demo:

    import re
    sentence = "The Batman has never shown his true identity but has done so much good for Gotham City"
    verb = re.findall(r'\bhas\s(?:(?!\bhas\b)[^,."]){0,50}?\bdone\b', sentence)
    print(verb)
    # => ['has done']