I want to build a training dataset by applying a previously trained tokenizer to my text file. The size of my text file is 7.02 GB (7,543,648,706 bytes). This is what I have written:
from transformers import LineByLineTextDataset
dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path="data.txt", block_size = ???
)
What does "block size" exactly mean here? How can I determine its value?
Most implementations of deep learning models cannot process sequential input data of variable lengths (they can if the batch size is 1, however, this is very inefficient and impractical). So, they take inputs of a fixed length.
For example, if an input batch of size 2 is:
hello world
my name is stack overflow
they should be padded to the max length (e.g., 10) like
hello world 0 0 0 0 0 0 0 0
my name is stack overflow 0 0 0 0 0
Your dataset
should provide batches of the fixed size and block_size
is for this purpose. If an input is too long, it will be truncated to blocks of the same size.