Search code examples
pythonbert-language-modelhuggingface-transformers

How to build a dataset for language modeling with the datasets library as with the old TextDataset from the transformers library


I am trying to load a custom dataset that I will then use for language modeling. The dataset consists of a text file that has a whole document in each line, meaning that each line overpasses the normal 512 tokens limit of most tokenizers.

I would like to understand what is the process to build a text dataset that tokenizes each line, having previously split the documents in the dataset into lines of a "tokenizable" size, as the old TextDataset class would do, where you only had to do the following, and a tokenized dataset without text loss would be available to pass to a DataCollator:

model_checkpoint = 'distilbert-base-uncased'

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

from transformers import TextDataset

dataset = TextDataset(
    tokenizer=tokenizer,
    file_path="path/to/text_file.txt",
    block_size=512,
)

Instead of this way, which is to be deprecated soon, I would like to use the datasets library. For now, what I have is the following, which, of course, throws an error because each line is longer than the maximum block size in the tokenizer:

import datasets
dataset = datasets.load_dataset('path/to/text_file.txt')

model_checkpoint = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def tokenize_function(examples):
    return tokenizer(examples["text"])

tokenized_datasets = dataset.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

So what would be the "standard" way of creating a dataset in the way it was done before but with the datasets lib?

Thank you very much for the help :))


Solution

  • I received an answer for this question on the HuggingFace Datasets forum by @lhoestq

    Hi !

    If you want to tokenize line by line, you can use this:

    max_seq_length = 512
    num_proc = 4
    
    def tokenize_function(examples):
        # Remove empty lines
        examples["text"] = [line for line in examples["text"] if len(line) > 0 and not line.isspace()]
        return tokenizer(
            examples["text"],
            truncation=True,
            max_length=max_seq_length,
        )
    
    tokenized_dataset = dataset.map(
        tokenize_function,
        batched=True,
        num_proc=num_proc,
        remove_columns=["text"],
    )
    

    Though the TextDataset was doing a different processing by concatenating all the texts and building blocks of size 512. If you need this behavior, then you must apply an additional map function after the tokenization:

    # Main data processing function that will concatenate all texts from
    # our dataset and generate chunks of max_seq_length.
    def group_texts(examples):
        # Concatenate all texts.
        concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
        total_length = len(concatenated_examples[list(examples.keys())[0]])
        # We drop the small remainder, we could add padding if the model supported it instead of this drop,
        # you can customize this part to your needs.
        total_length = (total_length // max_seq_length) * max_seq_length
        # Split by chunks of max_len.
        result = {
            k: [t[i : i + max_seq_length] for i in range(0, total_length, max_seq_length)]
            for k, t in concatenated_examples.items()
        }
        return result
    
    # Note that with `batched=True`, this map processes 1,000 texts together,
    # so group_texts throws away a remainder for each of those groups of 1,000 texts.
    # You can adjust that batch_size here but a higher value might be slower to preprocess.
    
    tokenized_dataset = tokenized_dataset.map(
        group_texts,
        batched=True,
        num_proc=num_proc,
    )
    

    This code comes from the processing of the run_mlm.py example script of transformers