Search code examples
pythonregextextfrench

regex make group appear only once


I am trying to run a regex query on Python and I have the following problem:

In french, subjects of a sentence can appear before and after the verb. For example, the sentence "she says" can be translated into "elle dit" and "dit-elle", where "elle" is "she" and "dit" is "says".

is it possible to capture only sentences containing "elle" and "dit", whether the subject "elle" is before or after the verb "dit" ? I have started with the following:

(elle).{0;10}(dit).{0;10}(elle)

But now I would like to make one of the (elle) optional when the other has been found. The * and + operators does not help in this case.


Solution

  • You can use PyPi regex module that can be installed using pip install regex (or pip3 install regex):

    import regex
    p = r'(?<=\b(?P<subject>il|elle)\b.{0,10})?\b(?P<predicate>dit|mange)\b(?=.{0,10}\b(?P<subject>il|elle)\b)?'
    print( [x.groupdict() for x in regex.finditer(p, 'elle dit et dit-elle et il mange ... dit-il', regex.S)])
    

    See the online Python demo

    The pattern may be created dynamically from variables:

    subjects = ['il', 'elle']
    predicates = ['dit', 'mange']
    p = fr'(?<=\b(?P<subject>{"|".join(subjects)})\b.{0,10})?\b(?P<predicate>{"|".join(predicates)})\b(?=.{0,10}\b(?P<subject>{"|".join(subjects)})\b)?'
    

    Details

    • (?<=\b(?P<subject>il|elle)\b.{0,10})? - an optional look back to grab a whole word il or elle within 0 to 10 chars from
    • \b(?P<predicate>dit|mange)\b - a whole word dit or mange
    • (?=.{0,10}\b(?P<subject>il|elle)\b)? - an optional look forward to grab a whole word il or elle within 0 to 10 chars from the predicate.