I'm using Python version 3.6 on a Windows machine. I'm reading in a text file using with open()
and readlines()
. After reading in the text file lines, I want to write certain lines to a new text file, but exclude certain ranges of lines. I do not know the line numbers of the lines to exclude. The text files are massive and the range of lines to exclude vary among the text files that I'm reading. There are known keywords I can search for to find the start and end of the range to exclude from the text file I want to write to.
I've searched everywhere online but I can't seem to find an elegant solution that works. The following is an example of what I'm trying to achieve.
a
b
BEGIN
c
d
e
END
f
g
h
i
j
BEGIN
k
l
m
n
o
p
q
END
r
s
t
u
v
BEGIN
w
x
y
END
z
In summary, I want to read the above into Python. Afterwards, write to a new file but exclude all lines starting at BEGIN and stopping at END keywords.
The new file should contain the following:
a
b
f
g
h
i
j
r
s
t
u
v
z
If the text files are massive, as you say, you'll want to avoid using readlines()
as that will load the entire thing in memory. Instead, read line by line and use a state variable to control whether you're in a block where output should be suppressed. Something sort of like,
import re
begin_re = re.compile("^BEGIN.*$")
end_re = re.compile("^END.*$")
should_write = True
with open("input.txt") as input_fh:
with open("output.txt", "w", encoding="UTF-8") as output_fh:
for line in input_fh:
# Strip off whitespace: we'll add our own newline
# in the print statement
line = line.strip()
if begin_re.match(line):
should_write = False
if should_write:
print(line, file=output_fh)
if end_re.match(line):
should_write = True