Ignoring irrelevant section in regex

I have the following text: (this is closely related to this but not similar)

text = '7\n\x0c\n7.\tA B C\n\n7.1\tbla bla bla .\n\n7.2\tanother bla bla \n\n7.3\tand another one.\n\n8.\tX Y Z\n\n8.1\tha ha ha \n\n(a)\thohoho ;\n\n(b)\thihihi,\n\n8'

I wish to select the section 7 only such that I get:

7.  A B C

7.1 bla bla bla .

7.2 another bla bla 

7.3 and another one.

So I do:

print(re.findall(r'^\d+\.\s*A B C[^\S\n]*(?:\n\n.+)*', text, re.M)[0])

which gives:

7.  A B C

7.1 bla bla bla .

7.2 another bla bla 

7.3 and another one.

8.  X Y Z

8.1 ha ha ha 

(a) hohoho ;

(b) hihihi,

8

As you can see 8 comes after 8.1. So this seems to be confusing for the regex, I wonder what can I do in this case?

Note that the number of the sections can be different in general, so I can not do something like re.findall(r'^7\..*', text, re.MULTILINE) (namely A B C can be places in other sections).

Solution

You can use

^(\d+\.)\s*A B C(?:\s*\n\1\b.*)*

^ Start of string
(\d+\.)\s*A B C Capture group 1 to match 1+ digits and ., then match A B C
(?: Non capture group to match as a whole
- \s*\n Match optional whitespace chars and a newline
- \1\b A backreference to group 1 followed by a word boundary
- .* Match the rest of the line
)* Close the non capture group and optionally repeat it

See a regex demo.

import re

text = '7\n\x0c\n7.\tA B C\n\n7.1\tbla bla bla .\n\n7.2\tanother bla bla \n\n7.3\tand another one.\n\n8.\tX Y Z\n\n8.1\tha ha ha \n\n(a)\thohoho ;\n\n(b)\thihihi,\n\n8'
pattern = r"^(\d+\.)\s*A B C(?:\s*\n\1\b.*)*"

m = re.search(pattern, text, re.MULTILINE)
if m:
    print(m.group())

Output

7.      A B C

7.1     bla bla bla .

7.2     another bla bla 

7.3     and another one.