Search code examples
pythonfilelarge-files

How do I properly read large text files in Python so I dont clog up memory?


So today while buying BTC I messed up and lost my decryption passphrase to wallet that ATM sends automatically on email.

I remember the last 4 characters of the passphrase so I generated a wordlist and wanted to try to bruteforce my way into it. It was a 4MB file and the script checked all the possibilities with no luck. Then I realized that maybe the letters are wrong, but I still remember what numbers were in those 4 chars. Well suddenly, I have 2GB file that get SIGKILLed by Ubuntu.

Here is the whole code, it is very short.

#!/usr/bin/python

from zipfile import ZipFile
import sys
i = 0
found = False

with ZipFile("/home/kuskus/Desktop/wallet.zip") as zf:
    with open('/home/kuskus/Desktop/wl.txt') as wordlist:
        for line in wordlist.readlines():
            if(not found):
                try:
                    zf.extractall(pwd = str.encode(line))
                    print("password found: %s" % line)
                    found = True
                except:
                    print(i)
                    i += 1
            else: sys.exit()

I think the issue is that the textfile fills up the memory so OS kills it. I really don't know how could I read the file, maybe by 1000 lines, then clean it and do another 1000 lines. If anyone could help me I would be very grateful, thank you in advance :) Oh and the text file has about 300 milion lines, if it matters.


Solution

  • Usually the best thing to do is to iterate over the file directly. The file handler will act as a generator, producing lines one at a time rather than aggregating them all into memory at once into a list (as fh.readlines() does):

    with open("somefile") as fh:
         for line in fh:
             # do something
    

    Furthermore, file handles allow you to read specific amounts of data if you so choose:

    with open("somefile") as fh:
        number_of_chars = fh.read(15) # 15 is the number of characters in a StringIO style handler
        while number_of_chars:
            # do something with number_of_chars
            number_of_chars = fh.read(15)
    

    Or, if you want to read a specific number of lines:

    with open('somefile') as fh:
        while True:
            chunk_of_lines = [fh.readline() for i in range(5)] # this will read 5 lines at a time
            if not chunk_of_lines:
                break
            # do something else here
    

    Where fh.readline() is analogous to calling next(fh) in a for loop.

    The reason a while loop is used in the latter two examples is because once the file has been completely iterated through, fh.readline() or fh.read(some_integer) will yield an empty string, which acts as False and will terminate the loop