python nlp pytorch huggingface-transformers huggingface-tokenizers

Tokenizing & encoding dataset uses too much RAM

Trying to tokenize and encode data to feed to a neural network.

I only have 25GB RAM and everytime I try to run the code below my google colab crashes. Any idea how to prevent his from happening? “Your session crashed after using all available RAM”

I thought tokenize/encoding chunks of 50000 sentences would work but unfortunately not. The code works on a dataset with length 1.3 million. The current dataset has a length of 5 million.

max_q_len = 128
max_a_len = 64    
trainq_list = train_q.tolist()    
batch_size = 50000
    
def batch_encode(text, max_seq_len):
      for i in range(0, len(trainq_list), batch_size):
        encoded_sent = tokenizer.batch_encode_plus(
            text,
            max_length = max_seq_len,
            pad_to_max_length=True,
            truncation=True,
            return_token_type_ids=False
        )
      return encoded_sent

    # tokenize and encode sequences in the training set
    tokensq_train = batch_encode(trainq_list, max_q_len)

The tokenizer comes from HuggingFace:

tokenizer = BertTokenizerFast.from_pretrained('bert-base-multilingual-uncased')

Solution

You should use generators and pass data to tokenizer.batch_encode_plus, no matter the size.

Conceptually, something like this:

Training list

This one probably holds list of sentences, which is read from some file(s). If this is a single large file, you could follow this answer to lazily read parts of the input (preferably of batch_size lines at once):

def read_in_chunks(file_object, chunk_size=1024):
    """Lazy function (generator) to read a file piece by piece.
    Default chunk size: 1k."""
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data

Otherwise open a single file (much smaller than memory, because it will be way larger after encoding using BERT), something like this:

import pathlib


def read_in_chunks(directory: pathlib.Path):
    # Use "*.txt" or any other extension your file might have
    for file in directory.glob("*"):
        with open(file, "r") as f:
            yield f.readlines()

Encoding

Encoder should take this generator and yield back encoded parts, something like this:

# Generator should create lists useful for encoding
def batch_encode(generator, max_seq_len):
    tokenizer = BertTokenizerFast.from_pretrained("bert-base-multilingual-uncased")
    for text in generator:
        yield tokenizer.batch_encode_plus(
            text,
            max_length=max_seq_len,
            pad_to_max_length=True,
            truncation=True,
            return_token_type_ids=False,
        )

Saving encoded files

As the files will be too large to fit in RAM memory, you should save them to disk (or use somehow as they are generated).

Something along those lines:

import numpy as np


# I assume np.arrays are created, adjust to PyTorch Tensors or anything if needed
def save(encoding_generator):
    for i, encoded in enumerate(encoding_generator):
        np.save(str(i), encoded)