Search code examples
pythonregexregex-lookarounds

Python | Regex | get numbers from the text


I have text of the form

Refer to Annex 1.1, 1.2 and 2.0 containing information etc,

or

Refer to Annex 1.0.1, 1.1.1 containing information etc,

I need to extract the numbers that the Annex is referring to. I have tried lookbehind regex as below.

m = re.search("(?<=Annex)\s*[\d+.\d+,]+", text)

print(m)
>>> <re.Match object; span=(11, 15), match=' 1.1'>

I get output as just 1.1, but I don't get remaining. How do I get all the numbers followed by keyword Annex ?


Solution

  • You can use the following two-step solution:

    import re
    texts = ['Refer to Annex 1.1, 1.2 and 2.0 containing information etc,', 'Refer to Annex 1.0.1, 1.1.1 containing information etc,']
    rx = re.compile(r'Annex\s*(\d+(?:(?:\W|and)+\d)*)')
    for text in texts:
        match = rx.search(text)
        if match:
            print(re.findall(r'\d+(?:\.\d+)*', match.group(1)) )
    

    See the Python and the regex demo, the output is

    ['1.1', '1.2', '2.0']
    ['1.0.1', '1.1.1']
    

    The Annex\s*(\d+(?:(?:\W|and)+\d)*) regex matches

    • Annex - the string Annex
    • \s* - zero or more whitespaces
    • (\d+(?:(?:\W|and)+\d)*) - Group 1: one or more digits and then zero or more occurrences of a non-word char or and string and then a digit.

    Then, when the match is found, all dot-separated digit sequences are extracted with \d+(?:\.\d+)*.