In my sample, all ages are between 10
and 99
. I want to find all instances of the variable "age" being >
or <
than exactly two digits. I need to know the equality sign and the two digits. I do not want the equality sign and the two digits if they correspond to a different variable (e.g., height or weight). For simplicity, all ages, heights, and weights are exactly two digits. There are no units.
sample_text = "age > 10 but can be > 20 or > 22 - if the height is > 60 then age can be > 30, otherwise it must be < 35"
The output I am seeking is an age_list
that looks like [(">", "10"), (">", "20"), (">", "22"), (">", "30"), ("<", "35)]
. The list should be able to be of any length.
It is easy enough to get them all when the format is always "age" followed by sign followed by digits. I used the code below and it pulls out [(">", "10"), (">", "30")]
, but I can't get the other digits - e.g., the 20
and 22
that are clearly tied to age. I need to get those but avoid the 60
tied to height (and any digit tied to weight if there is weight).
re.findall("age\s*[a-zA-Z\s]*(>|<)\s*(\d\d)", sample_text)
I have workarounds using re.search
when the format is "age sign digit or something something sign digit", but the work arounds fail if there are a bunch of signs and digits that don't have "age" before them - e.g., "the age must be > 20 or if this then > 24 or also > 30 if that..."
You can first temper the match to start at age, and do not cross matching either height
or width
\bage(?:(?:(?!\b(?:height|weight)\b)[^<>])*[<>]\s+\d+)+
The pattern matches:
\bage
Match age
preceded by a word boundary(?:
Outer non capture group to repeat as a whole
(?:
Inner non capture group to repeat as a whole
(?!\b(?:height|weight)\b)
Negative lookahead, assert not height
or weight
directly to the right and use a word boundary to prevent a partial match[^<>]
Match any char except <
or >
)*
Close inner non capture group and optionally repeat[<>]\s+\d+
Match either <
or >
then 1+ whitespace chars and 1+ digits)+
Close outer group and repeat 1+ timesThe process the matches with re.findall and ([<>])\s+(\d+)
using 2 capture groups, capturing the sign in group 1 and the digits in group 2
import re
pattern = r"\bage(?:(?:(?!\b(?:height|weight)\b)[^<>])*[<>]\s+\d+)+"
s = ("age > 10 but can be > 20 or > 22 - if the height is > 60 then age can be > 30, otherwise it must be < 35\n")
for m in re.findall(pattern, s):
print(re.findall(r"([<>])\s+(\d+)", m))
Output
[('>', '10'), ('>', '20'), ('>', '22')]
[('>', '30'), ('<', '35')]