I have the following simple code copied from Huggingface examples:
model_checkpoint = "distilgpt2"
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
def tokenize_function(examples):
return tokenizer(examples["text"])
from datasets import load_dataset
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')
tokenized_datasets = datasets.map(tokenize_function, batched=False, num_proc=4, remove_columns=["text"])
When I set batched=False
then the progress bar shows green color which indicates success, but if I set batched=True
then the progress bar shows red color and does not reach 100%. Does that mean my map
function failed or something else?
It is likely a bug in the printing logic, not in processing itself. Some relevant discussion at discuss.huggingface.co is here and on GitHub it is here.