Search code examples
huggingfacehuggingface-datasets

Huggingface Dataset.map shows red progress bar when batched=True


I have the following simple code copied from Huggingface examples:

model_checkpoint = "distilgpt2"

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

def tokenize_function(examples):
    return tokenizer(examples["text"])

from datasets import load_dataset
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')
tokenized_datasets = datasets.map(tokenize_function, batched=False, num_proc=4, remove_columns=["text"])

When I set batched=False then the progress bar shows green color which indicates success, but if I set batched=True then the progress bar shows red color and does not reach 100%. Does that mean my map function failed or something else?


Solution

  • It is likely a bug in the printing logic, not in processing itself. Some relevant discussion at discuss.huggingface.co is here and on GitHub it is here.