Search code examples
pythonregexstringtext

Ignoring irrelevant section in regex


I have the following text: (this is closely related to this but not similar)

text = '7\n\x0c\n7.\tA B C\n\n7.1\tbla bla bla .\n\n7.2\tanother bla bla \n\n7.3\tand another one.\n\n8.\tX Y Z\n\n8.1\tha ha ha \n\n(a)\thohoho ;\n\n(b)\thihihi,\n\n8'

I wish to select the section 7 only such that I get:

7.  A B C

7.1 bla bla bla .

7.2 another bla bla 

7.3 and another one.

So I do:

print(re.findall(r'^\d+\.\s*A B C[^\S\n]*(?:\n\n.+)*', text, re.M)[0])

which gives:

7.  A B C

7.1 bla bla bla .

7.2 another bla bla 

7.3 and another one.

8.  X Y Z

8.1 ha ha ha 

(a) hohoho ;

(b) hihihi,

8

As you can see 8 comes after 8.1. So this seems to be confusing for the regex, I wonder what can I do in this case?

Note that the number of the sections can be different in general, so I can not do something like re.findall(r'^7\..*', text, re.MULTILINE) (namely A B C can be places in other sections).


Solution

  • You can use

    ^(\d+\.)\s*A B C(?:\s*\n\1\b.*)*
    
    • ^ Start of string
    • (\d+\.)\s*A B C Capture group 1 to match 1+ digits and ., then match A B C
    • (?: Non capture group to match as a whole
      • \s*\n Match optional whitespace chars and a newline
      • \1\b A backreference to group 1 followed by a word boundary
      • .* Match the rest of the line
    • )* Close the non capture group and optionally repeat it

    See a regex demo.

    import re
    
    text = '7\n\x0c\n7.\tA B C\n\n7.1\tbla bla bla .\n\n7.2\tanother bla bla \n\n7.3\tand another one.\n\n8.\tX Y Z\n\n8.1\tha ha ha \n\n(a)\thohoho ;\n\n(b)\thihihi,\n\n8'
    pattern = r"^(\d+\.)\s*A B C(?:\s*\n\1\b.*)*"
    
    m = re.search(pattern, text, re.MULTILINE)
    if m:
        print(m.group())
    

    Output

    7.      A B C
    
    7.1     bla bla bla .
    
    7.2     another bla bla 
    
    7.3     and another one.