I am trying to run a regex query on Python and I have the following problem:
In french, subjects of a sentence can appear before and after the verb. For example, the sentence "she says" can be translated into "elle dit" and "dit-elle", where "elle" is "she" and "dit" is "says".
is it possible to capture only sentences containing "elle" and "dit", whether the subject "elle" is before or after the verb "dit" ? I have started with the following:
(elle).{0;10}(dit).{0;10}(elle)
But now I would like to make one of the (elle)
optional when the other has been found. The *
and +
operators does not help in this case.
You can use PyPi regex
module that can be installed using pip install regex
(or pip3 install regex
):
import regex
p = r'(?<=\b(?P<subject>il|elle)\b.{0,10})?\b(?P<predicate>dit|mange)\b(?=.{0,10}\b(?P<subject>il|elle)\b)?'
print( [x.groupdict() for x in regex.finditer(p, 'elle dit et dit-elle et il mange ... dit-il', regex.S)])
See the online Python demo
The pattern may be created dynamically from variables:
subjects = ['il', 'elle']
predicates = ['dit', 'mange']
p = fr'(?<=\b(?P<subject>{"|".join(subjects)})\b.{0,10})?\b(?P<predicate>{"|".join(predicates)})\b(?=.{0,10}\b(?P<subject>{"|".join(subjects)})\b)?'
Details
(?<=\b(?P<subject>il|elle)\b.{0,10})?
- an optional look back to grab a whole word il
or elle
within 0 to 10 chars from\b(?P<predicate>dit|mange)\b
- a whole word dit
or mange
(?=.{0,10}\b(?P<subject>il|elle)\b)?
- an optional look forward to grab a whole word il
or elle
within 0 to 10 chars from the predicate.