Search code examples
pythonmemory-managementlarge-filesmergesort

Python - reading and deleting the top line of a file without loading it into memory


I need to mergeSort text files which are about 150 MB each, and together will amount to about 5GB

The problem is that i can't use mergesort using readlines(), since the last step would need to load 5GB into the memory, and with only the

for line1 in file1, line2 in file2:
    while( line1 & line2 )...

command, i can't tell python to only get the next line of file 1, and keep the line of file 2, and thus are unable to make a merge sort

i read something about setting the readbuffer really low on readlines(), only loading a single line into the memory, but then i can't delete the first line from the file

is there any other memory efficient way to get only the first line of a file and deleting it, or is there an available function to mergesort two text files somewhere allready?


Solution

  • command, i can't tell python to only get the next line of file 1, and keep the line of file 2, and thus are unable to make a merge sort

    No you can.

    line1 = file1.readline()
    line2 = file2.readline()
    while file1_not_at_end and file2_not_at_end:
        if line1 < line2:
            file3.write(line1)
            line1 = file1.readline()
        else:
            file3.write(line2)
            line2 = file2.readline()
    
     # merge file 1 into file 3
     # merge file 2 into file 3