Search code examples
pythonregexregex-group

How to find all instances of two digits after ">" or "<" for a particular variable but not for other variables


In my sample, all ages are between 10 and 99. I want to find all instances of the variable "age" being > or < than exactly two digits. I need to know the equality sign and the two digits. I do not want the equality sign and the two digits if they correspond to a different variable (e.g., height or weight). For simplicity, all ages, heights, and weights are exactly two digits. There are no units.

sample_text = "age > 10 but can be > 20 or > 22 - if the height is > 60 then age can be > 30, otherwise it must be < 35"

The output I am seeking is an age_list that looks like [(">", "10"), (">", "20"), (">", "22"), (">", "30"), ("<", "35)]. The list should be able to be of any length.

It is easy enough to get them all when the format is always "age" followed by sign followed by digits. I used the code below and it pulls out [(">", "10"), (">", "30")], but I can't get the other digits - e.g., the 20 and 22 that are clearly tied to age. I need to get those but avoid the 60 tied to height (and any digit tied to weight if there is weight).

re.findall("age\s*[a-zA-Z\s]*(>|<)\s*(\d\d)", sample_text)

I have workarounds using re.search when the format is "age sign digit or something something sign digit", but the work arounds fail if there are a bunch of signs and digits that don't have "age" before them - e.g., "the age must be > 20 or if this then > 24 or also > 30 if that..."


Solution

  • You can first temper the match to start at age, and do not cross matching either height or width

    \bage(?:(?:(?!\b(?:height|weight)\b)[^<>])*[<>]\s+\d+)+
    

    The pattern matches:

    • \bage Match age preceded by a word boundary
    • (?: Outer non capture group to repeat as a whole
      • (?: Inner non capture group to repeat as a whole
        • (?!\b(?:height|weight)\b) Negative lookahead, assert not height or weight directly to the right and use a word boundary to prevent a partial match
        • [^<>] Match any char except < or >
      • )* Close inner non capture group and optionally repeat
      • [<>]\s+\d+ Match either < or > then 1+ whitespace chars and 1+ digits
    • )+ Close outer group and repeat 1+ times

    Regex demo | Python demo

    The process the matches with re.findall and ([<>])\s+(\d+) using 2 capture groups, capturing the sign in group 1 and the digits in group 2

    import re
    
    pattern = r"\bage(?:(?:(?!\b(?:height|weight)\b)[^<>])*[<>]\s+\d+)+"
    s = ("age > 10 but can be > 20 or > 22 - if the height is > 60 then age can be > 30, otherwise it must be < 35\n")
    
    for m in re.findall(pattern, s):
        print(re.findall(r"([<>])\s+(\d+)", m))
    

    Output

    [('>', '10'), ('>', '20'), ('>', '22')]
    [('>', '30'), ('<', '35')]