Search code examples
pythonpython-3.xdata-analysis

How to slice data from a text file given the desired range of lines


I have a very large text file and I want to slice multiple specific parts of it, and then create a new text file with only the sliced data. My approach was to find, first the line numbers where the desired part begins and ends, to consequently use them as the ranges for slicing. The reason of this is that the text file contains large parts with also descriptions, annotations, that I need to get rid of. Should I use itertools.islice?

KMAPspec = open("KMAP_2018_04_23_071018_fast_00001.txt","r")
DataStartLine=[]
DataEndLine=[]
for x, line in enumerate(KMAPspec):   
    if line.find("#C imageFile")!=-1:
        DataStartLine.append(x)
    if line.find("#S")!=-1:
        DataEndLine.append(x)
with open("output.txt","w") as out:    

Solution

  • When the text file is really big, keeping content into a variable is dangerous because it could get you out of memory. In your case it seems that you could read and write in the same pass. If your #C and #S should be excluded from the output:

    with open("KMAP_2018_04_23_071018_fast_00001.txt","r") as KMAPspec:
        with open("output.txt","w") as out: 
            should_write = False
            for line in KMAPspec:
                # When I meet this line, stop writing out
                if line.find("#S")!=-1:
                    should_write = False
                # Write out only if between the two tags
                if should_write:
                    out.write(line)
                # When I meet this line, start writing out   
                if line.find("#C imageFile")!=-1:
                    should_write = True
    

    This way you store nothing in memory.

    If the boundary lines should be included:

    with open("KMAP_2018_04_23_071018_fast_00001.txt","r") as KMAPspec:
        with open("output.txt","w") as out: 
            should_write = False
            for line in KMAPspec:
                # When I meet this line, start writing out   
                if line.find("#C imageFile")!=-1:
                    should_write = True
                # Write out only if between the two tags
                if should_write:
                    out.write(line)
                # When I meet this line, stop writing out
                if line.find("#S")!=-1:
                    should_write = False