Search code examples
pythonregexstringtext

Using Regex to select specific section of a text


Suppose I have the following document:

document1 = '1. Hello world\n1.1 bla bla bla\n1.2 more bla bla\n1.3 even more bla bla ABC\n\n2. ABC \n2.1 hello ABC\n2.2 bla bla bla\n\n3. XYZ\n3.1 bla bla\n3.2 more bla bla\n3.3 even more bla bla'

which has the following format:

1. Hello world
1.1 bla bla bla
1.2 more bla bla
1.3 even more bla bla ABC

2. ABC 
2.1 hello ABC
2.2 bla bla bla

3. XYZ
3.1 bla bla
3.2 more bla bla
3.3 even more bla bla

I wonder how can I select the ABC section only, such that I get the output as:

2. ABC 
2.1 hello ABC
2.2 bla bla bla

One might suggest doing re.findall(r'^2\..*', document1, re.MULTILINE) but NOTE ABC section doesn't always have to be at point 2. For instance I can have:

document2 = '1. Hello world\n1.1 bla bla bla\n1.2 more bla bla\n1.3 even more bla bla ABC\n\n2. XYZ\n2.1 bla bla\n2.2 more bla bla\n2.3 even more bla bla\n\n\n3. MNO\n 3.1 hello MNO\n3.2 bla bla bla\n\n\n4. ABC\n4.1 hello ABC\n4.2 bla bla bla'

1. Hello world
1.1 bla bla bla
1.2 more bla bla
1.3 even more bla bla ABC

2. XYZ
2.1 bla bla
2.2 more bla bla
2.3 even more bla bla

3. MNO 
3.1 hello MNO
3.2 bla bla bla

4. ABC 
4.1 hello ABC
4.2 bla bla bla

where ABC is in section 4.


Solution

  • You can use

    ^\d+\.\s*ABC[^\S\n]*(?:\n.+)*
    

    See the regex demo. Only pass re.M flag when compiling the regex object. Details:

    • ^ - start of a line
    • \d+ - one or more digits
    • \. - a dot
    • \s* - zero or more whitespaces
    • ABC - ABC string
    • [^\S\n]* - zero or more whitespaces other than an LF char
    • (?:\n.+)* - zero or more non-empty lines.

    To get all matches, you can use

    matches =  re.findall(r'^\d+\.\s*ABC[^\S\n]*(?:\n.+)*', document1, re.M)
    

    To get the first match only you can use

    match = re.search(r'^\d+\.\s*ABC[^\S\n]*(?:\n.+)*', document1, re.M)
    if match:
        print(match.group())