Search code examples
regexpython-3.xtext-manipulation

How to extract text between two substrings from a Python file


I want to read the text between two characters (“#*” and “#@”) from a file. My file contains thousands of records in the above-mentioned format. I have tried using the code below, but it is not returning the required output. My data contains thousands of records in the given format.

import re
start = '#*'
end = '#@'
myfile = open('lorem.txt')
for line in fhand:
    text = text.rstrip()
    print (line[line.find(start)+len(start):line.rfind(end)])
myfile.close()

My Input:

\#*OQL[C++]: Extending C++ with an Object Query Capability

\#@José A. Blakeley

\#t1995

\#cModern Database Systems

\#index0

\#*Transaction Management in Multidatabase Systems

\#@Yuri Breitbart,Hector Garcia-Molina,Abraham Silberschatz

\#t1995

\#cModern Database Systems

\#index1

My Output:

51103
OQL[C++]: Extending C++ with an Object Query Capability

t199
cModern Database System
index
...

Expected output:

OQL[C++]: Extending C++ with an Object Query Capability
Transaction Management in Multidatabase Systems

Solution

  • You are reading the file line by line, but your matches span across lines. You need to read the file in and process it with a regex that can match any chars across lines:

    import re
    start = '#*'
    end = '#@'
    rx = r'{}.*?{}'.format(re.escape(start), re.escape(end)) # Escape special chars, build pattern dynamically
    with open('lorem.txt') as myfile:
        contents = myfile.read()                     # Read file into a variable
        for match in re.findall(rx, contents, re.S): # Note re.S will make . match line breaks, too
            # Process each match individually
    

    See the regex demo.