Loading / Streaming 8GB txt file?? And tokenize

I have a pretty large file (about 8 GB).. now I read this post: How to read a large file line by line and this one Tokenizing large (>70MB) TXT file using Python NLTK. Concatenation & write data to stream errors

But this still doesnt do the job.. when I run my code, my pc gets stuck. Am I doing something wrong?

I want to get all words into a list (tokenize them). Further, doesnt the code reads each line and tokenizes the line? Doesnt this might prevent the tokenizer from tokenizing words properly since some words (and sentences) do not end after just one line?

I considered splitting it up into smaller files, but doesnt this still consume my RAM if I just have 8GB Ram since the list of words will probably be equally big (8GB) like the initial txt file?

word_list=[]
number = 0
with open(os.path.join(save_path, 'alldata.txt'), 'rb',encoding="utf-8") as t:
    for line in t.readlines():
        word_list+=nltk.word_tokenize(line)
        number = number + 1
        print(number)

Solution

By using the following line:

for line in t.readlines():
    # do the things

You are forcing python to read the whole file with t.readlines(), then return an array of strings that represents the whole file, thus bringing the whole file into memory.

Instead, if you do as the example you linked states:

for line in t:
    # do the things

The Python VM will natively process the file line-by-line, like you want. the file will act like a generator, yielding each line one at a time.

After looking at your code again, I see that you are constantly appending to the word list, with word_list += nltk.word_tokenize(line). This means that even if you do import the file one line at a time, you are still retaining that data in your memory, even after the file has moved on. You will likely need to find a better way of doing whatever this is, as you will still be consuming massive amounts of memory, because the data has not been dropped from memory.

For data this large, you will have to either

find a way to store an intermediate version of your tokenized data, or
design your code in a way that you can handle one, or just a few tokenized words at a time.

Some thing like this might do the trick:

def enumerated_tokens(filepath):
    index = 0
    with open(filepath, rb, encoding="utf-8") as t:
        for line in t:
            for word in nltk.word_tokenize(line):
                yield (index, word)
                index += 1

for index, word in enumerated_tokens(os.path.join(save_path, 'alldata.txt')):
    print(index, word)
    # Do the thing with your word.

Notice how this never actually stores the word anywhere. This doesn't mean that you can't temporarily store anything, but if you're memory constrained, generators are the way to go. This approach will likely be faster, more stable, and use less memory overall.