How to find all instances of two digits after ">" or "<" for a particular variable but not for other variables

In my sample, all ages are between 10 and 99. I want to find all instances of the variable "age" being > or < than exactly two digits. I need to know the equality sign and the two digits. I do not want the equality sign and the two digits if they correspond to a different variable (e.g., height or weight). For simplicity, all ages, heights, and weights are exactly two digits. There are no units.

sample_text = "age > 10 but can be > 20 or > 22 - if the height is > 60 then age can be > 30, otherwise it must be < 35"

The output I am seeking is an age_list that looks like [(">", "10"), (">", "20"), (">", "22"), (">", "30"), ("<", "35)]. The list should be able to be of any length.

It is easy enough to get them all when the format is always "age" followed by sign followed by digits. I used the code below and it pulls out [(">", "10"), (">", "30")], but I can't get the other digits - e.g., the 20 and 22 that are clearly tied to age. I need to get those but avoid the 60 tied to height (and any digit tied to weight if there is weight).

re.findall("age\s*[a-zA-Z\s]*(>|<)\s*(\d\d)", sample_text)

I have workarounds using re.search when the format is "age sign digit or something something sign digit", but the work arounds fail if there are a bunch of signs and digits that don't have "age" before them - e.g., "the age must be > 20 or if this then > 24 or also > 30 if that..."

Solution

You can first temper the match to start at age, and do not cross matching either height or width

\bage(?:(?:(?!\b(?:height|weight)\b)[^<>])*[<>]\s+\d+)+

The pattern matches:

\bage Match age preceded by a word boundary
(?: Outer non capture group to repeat as a whole
- (?: Inner non capture group to repeat as a whole
  - (?!\b(?:height|weight)\b) Negative lookahead, assert not height or weight directly to the right and use a word boundary to prevent a partial match
  - [^<>] Match any char except < or >
- )* Close inner non capture group and optionally repeat
- [<>]\s+\d+ Match either < or > then 1+ whitespace chars and 1+ digits
)+ Close outer group and repeat 1+ times

Regex demo | Python demo

The process the matches with re.findall and ([<>])\s+(\d+) using 2 capture groups, capturing the sign in group 1 and the digits in group 2

import re

pattern = r"\bage(?:(?:(?!\b(?:height|weight)\b)[^<>])*[<>]\s+\d+)+"
s = ("age > 10 but can be > 20 or > 22 - if the height is > 60 then age can be > 30, otherwise it must be < 35\n")

for m in re.findall(pattern, s):
    print(re.findall(r"([<>])\s+(\d+)", m))

Output

[('>', '10'), ('>', '20'), ('>', '22')]
[('>', '30'), ('<', '35')]