As a beginner, I'm trying to solve the following problem (bash or python script):
the file (~50G!):
marker
xxx
xxx
xxx
pattern
marker
xxx
xxx
xxx
marker
xxx
xxx
xxx
pattern
I would like to find a way to remove the lines between two markers
+ the first marker
, but not the last occurrence of the marker
IF no pattern
can be found throughout the lines.
Wanted result:
marker
xxx
xxx
xxx
pattern
[empty!]
marker
xxx
xxx
xxx
pattern
I tried to solve it with regex or awk (that's a very shy beginning)
awk '/marker/{f=1} f; /marker/{f=1}' file
but I'm having a hardtime understanding how to implement that in a function that would solve the entire problem. It would make me very happy if someone could help me with that!
Cheers
Here's a way to do it in python. Treat marker
as a separator, then remove anything from the text snippets between that don't contain pattern
f = open('markerfile.txt','r')
lines = f.read().split('marker\n')
lines = [entry for entry in lines if 'pattern' in entry or not entry]
print 'marker\n'.join(lines)
Edit: the or not entry
bit in the list comprehension just handles the case where marker
is the first line in the file.
Edit 2: Here's a streaming version (better suited for large files.) It uses islice
from itertools
to get n
lines of the file at a time. The rest of the algorithm is more or less the same.
from itertools import islice
f = open('markerfile.txt','r')
fout = open('markersout.txt','w')
n=5
while True:
next_n_lines = ''.join(list(islice(f, n)))
if not next_n_lines:
break
lines = next_n_lines.split('marker\n')
lines = [entry for entry in lines if 'pattern' in entry or not entry]
print >> fout, 'marker\n'.join(lines).strip()
f.close()
fout.close()