I'm trying to parse repeating blocks of text that all begin with '----BEGIN---' and end with '---END', using Python. So the text file will look like below. Basically, I want to be able to find each block (words, numbers, and special characters) and parse them for further analysis. The code below is as close as I have gotten, but it returns the entire document, not each block. Any help would be appreciated.
block_search = re.compile('----BEGIN---.*---END',re.DOTALL)
with open(file,'r',encoding='utf-8') as f:
text = f.read()
result = re.findall(block_search,text)
----BEGIN--- Words Special Character Numbers words Special character words numbers words words. words numbers words Special character words numbers words words words numbers words words ---END
----BEGIN--- Words words numbers words Special character words numbers words words. words numbers words Special character words numbers words words words numbers words words ... ---END
'----BEGIN---.*---END'
will match anything from the first occurence of ----BEGIN---
to the last occurence of ---END
, that is what .*
does.
If you want to find the specific block, use .*?
, it will stop after the first occurrence of substring after it, or in other words, it will search only until it finds the substring after it.
block_search = re.compile('----BEGIN---.*?---END',re.DOTALL)