Search code examples
pythonfilelines

How do I remove lines from a big file in Python, within limited environment


Say I have a 10GB HDD Ubuntu VPS in the USA (and I live in some where else), and I have a 9GB text file on the hard drive. I have 512MB of RAM, and about the same amount of swap.

Given the fact that I cannot add more HDD space and cannot move the file to somewhere else to process, is there an efficient method to remove some lines from the file using Python (preferably, but any other language will be acceptable)?


Solution

  • How about this? It edits the file in place. I've tested it on some small text files (in Python 2.6.1), but I'm not sure how well it will perform on massive files because of all the jumping around, but still...

    I've used a indefinite while loop with a manual EOF check, because for line in f: didn't work correctly (presumably all the jumping around messes up the normal iteration). There may be a better way to check this, but I'm relatively new to Python, so someone please let me know if there is.

    Also, you'll need to define the function isRequired(line).

    writeLoc = 0
    readLoc = 0
    with open( "filename" , "r+" ) as f:
        while True:
            line = f.readline()
    
            #manual EOF check; not sure of the correct
            #Python way to do this manually...
            if line == "":
                break
    
            #save how far we've read
            readLoc = f.tell()
    
            #if we need this line write it and
            #update the write location
            if isRequired(line):
                f.seek( writeLoc )
                f.write( line )
                writeLoc = f.tell()
                f.seek( readLoc )
    
        #finally, chop off the rest of file that's no longer needed
        f.truncate( writeLoc )