Search code examples
pythonregextextnlp

Python Regex: How Can I find Recurring Blocks of Texts in a Text File


I'm trying to parse repeating blocks of text that all begin with '----BEGIN---' and end with '---END', using Python. So the text file will look like below. Basically, I want to be able to find each block (words, numbers, and special characters) and parse them for further analysis. The code below is as close as I have gotten, but it returns the entire document, not each block. Any help would be appreciated.

block_search = re.compile('----BEGIN---.*---END',re.DOTALL)
with open(file,'r',encoding='utf-8') as f:
    text = f.read()
    result = re.findall(block_search,text)

----BEGIN--- Words Special Character Numbers words Special character words numbers words words. words numbers words Special character words numbers words words words numbers words words ---END

----BEGIN--- Words words numbers words Special character words numbers words words. words numbers words Special character words numbers words words words numbers words words ... ---END


Solution

  • '----BEGIN---.*---END' will match anything from the first occurence of ----BEGIN--- to the last occurence of ---END, that is what .* does. If you want to find the specific block, use .*?, it will stop after the first occurrence of substring after it, or in other words, it will search only until it finds the substring after it.

    block_search = re.compile('----BEGIN---.*?---END',re.DOTALL)