Search code examples
pythontransformer-model

How to determine the block size in training a dataset


I want to build a training dataset by applying a previously trained tokenizer to my text file. The size of my text file is 7.02 GB (7,543,648,706 bytes). This is what I have written:

from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="data.txt", block_size = ???
)

What does "block size" exactly mean here? How can I determine its value?


Solution

  • Most implementations of deep learning models cannot process sequential input data of variable lengths (they can if the batch size is 1, however, this is very inefficient and impractical). So, they take inputs of a fixed length.

    For example, if an input batch of size 2 is:

    hello world
    my name is stack overflow
    

    they should be padded to the max length (e.g., 10) like

    hello world 0  0     0        0 0 0 0 0
    my    name  is stack overflow 0 0 0 0 0
    

    Your dataset should provide batches of the fixed size and block_size is for this purpose. If an input is too long, it will be truncated to blocks of the same size.