Search code examples
pythonregexregex-lookarounds

python regex match full paragraph including new line


I've a text file, from that I want to match the full paragraph block but my current regex doesn't work to match full paragraph including the new line.

Text Example:

NOMEAR JOSIAS CARLOS BORRHER do cargo em comissão
OTHER TEXT GOES HERE
....................
020007/002832/2020.

EXONERAR DOUGLAS ALVES BORRHER do cargo em comissão
OTHER TEXT GOES HERE
....................
020007/002832/2020.

NOMEAR RAFAEL DOS SANTOS PASSAGEM para exercer o cargo
OTHER TEXT GOES HERE
....................
020007/002832/2020.

From the above text block I want to match the full paragraph starting with word NOMEAR

NOMEAR JOSIAS CARLOS BORRHER do cargo em comissão
OTHER TEXT GOES HERE
....................
020007/002832/2020.


NOMEAR RAFAEL DOS SANTOS PASSAGEM para exercer o cargo
OTHER TEXT GOES HERE
....................
020007/002832/2020.

What I have tried

import re
pattern = re.compile("NOMEAR (.*)", re.DOTALL)

for i, line in enumerate(open('pdf_text_tika.txt')):
    for match in re.finditer(pattern, line):
        print ('Found on line %s: %s' % (i+1, match.group()))

Output:

Found on line 1305: NOMEAR JOSIAS CARLOS BORRHER do cargo em comissão

Found on line 1316: NOMEAR RAFAEL DOS SANTOS PASSAGEM para exercer o cargo


Solution

  • You may use this simpler regex using MULTILINE mode:

    ^NOMEAR.+(?:\n.+)*
    

    In python:

    import re
    
    pattern = re.compile(r'^NOMEAR.+(?:\n.+)*', re.MULTILINE)
    
    with open('pdf_text_tika.txt', 'r') as file:
        data = file.read()
    
    print (pattern.findall(data))
    

    RegEx Demo