Search code examples
pythonbashdesign-patternsmarkerlines

Python or bash script: if pattern in lines between two identical markers, remove lines and first marker


As a beginner, I'm trying to solve the following problem (bash or python script):

the file (~50G!):

marker
xxx
xxx
xxx
pattern
marker
xxx
xxx
xxx
marker
xxx
xxx
xxx
pattern

I would like to find a way to remove the lines between two markers + the first marker, but not the last occurrence of the marker IF no pattern can be found throughout the lines.

Wanted result:

marker
xxx
xxx
xxx
pattern
[empty!]
marker
xxx
xxx
xxx
pattern

I tried to solve it with regex or awk (that's a very shy beginning)

awk '/marker/{f=1} f; /marker/{f=1}' file

but I'm having a hardtime understanding how to implement that in a function that would solve the entire problem. It would make me very happy if someone could help me with that!

Cheers


Solution

  • Here's a way to do it in python. Treat marker as a separator, then remove anything from the text snippets between that don't contain pattern

    f = open('markerfile.txt','r')
    
    lines = f.read().split('marker\n')
    lines = [entry for entry in lines if 'pattern' in entry or not entry]
    print 'marker\n'.join(lines)
    

    Edit: the or not entry bit in the list comprehension just handles the case where marker is the first line in the file.

    Edit 2: Here's a streaming version (better suited for large files.) It uses islice from itertools to get n lines of the file at a time. The rest of the algorithm is more or less the same.

    from itertools import islice
    
    f = open('markerfile.txt','r')
    fout = open('markersout.txt','w')
    
    n=5
    while True:
        next_n_lines = ''.join(list(islice(f, n)))
        if not next_n_lines:
            break
        lines = next_n_lines.split('marker\n')
        lines = [entry for entry in lines if 'pattern' in entry or not entry]
        print >> fout, 'marker\n'.join(lines).strip()
    
    f.close()
    fout.close()