I want to read the text between two characters (“#*”
and “#@”
) from a file. My file contains thousands of records in the above-mentioned format. I have tried using the code below, but it is not returning the required output. My data contains thousands of records in the given format.
import re
start = '#*'
end = '#@'
myfile = open('lorem.txt')
for line in fhand:
text = text.rstrip()
print (line[line.find(start)+len(start):line.rfind(end)])
myfile.close()
My Input:
\#*OQL[C++]: Extending C++ with an Object Query Capability
\#@José A. Blakeley
\#t1995
\#cModern Database Systems
\#index0
\#*Transaction Management in Multidatabase Systems
\#@Yuri Breitbart,Hector Garcia-Molina,Abraham Silberschatz
\#t1995
\#cModern Database Systems
\#index1
My Output:
51103
OQL[C++]: Extending C++ with an Object Query Capability
t199
cModern Database System
index
...
Expected output:
OQL[C++]: Extending C++ with an Object Query Capability
Transaction Management in Multidatabase Systems
You are reading the file line by line, but your matches span across lines. You need to read the file in and process it with a regex that can match any chars across lines:
import re
start = '#*'
end = '#@'
rx = r'{}.*?{}'.format(re.escape(start), re.escape(end)) # Escape special chars, build pattern dynamically
with open('lorem.txt') as myfile:
contents = myfile.read() # Read file into a variable
for match in re.findall(rx, contents, re.S): # Note re.S will make . match line breaks, too
# Process each match individually
See the regex demo.