Describe the bug The model I am using (TrOCR Model):
The problem arises when using:
processor = TrOCRProcessor.from_pretrained("microsoft/trocr-large-handwritten")
class Dataset(Dataset):
def __init__(self, root_dir, df, processor, max_target_length=128):
self.root_dir = root_dir
self.df = df
self.processor = processor
self.max_target_length = max_target_length
def __len__(self):
return len(self.df)
def __getitem__(self, idx):
# get file name + text
file_name = self.df['file_name'][idx]
text = self.df['text'][idx]
# prepare image (i.e. resize + normalize)
image = Image.open(self.root_dir + file_name).convert("RGB")
pixel_values = self.processor(image, return_tensors="pt").pixel_values
# add labels (input_ids) by encoding the text
labels = self.processor.tokenizer(text,
padding="max_length",
max_length=self.max_target_length).input_ids
# important: make sure that PAD tokens are ignored by the loss function
labels = [label if label != self.processor.tokenizer.pad_token_id else -100 for label in labels]
# encoding
return {"pixel_values": pixel_values.squeeze(), "labels": torch.tensor(labels)}
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-large-handwritten")
model.config.decoder_start_token_id = processor.tokenizer.cls_token_id
model.config.pad_token_id = processor.tokenizer.pad_token_id
model.config.vocab_size = model.config.decoder.vocab_size
model.config.eos_token_id = processor.tokenizer.sep_token_id
# python3 train.py path/to/labels path/to/images/
A clear and concise description of what the bug is. To Reproduce Steps to reproduce the behavior:
<s><s>
or ids [0,0, ......,2,1,1, 1 ]
training
phase the show generated tokens in compute_metrics
Input predictions: [[0,0,506,4422,8046,2,1,1,1,1,1]]
Input references: [[0,597,2747 ...,1,1,1]]
testing
models [Expected behavior A clear and concise description of what you expected to happen.
In 2 reproduced problems:
I am expecting during training Input predictions: [[,0,,506,4422,8046,2,1,1,1,1,1 ]]
In addition during the testing phase: generated text without double
tensor([[0,11867,405,22379,1277,..........,368,2]])
<s>ennyit erről, tőlem fényképezz amennyit akarsz, a véleményem akkor</s>
The problem comes from passed token ids. I am adding start token from tokenizer + another start token from the TrOCR model so the duplication happens. the solution is super easy by just skipping start token coming from the tokenizer by using labels = labels[1:]
def __getitem__(self, idx):
# get file name + text
file_name = self.df['file_name'][idx]
text = self.df['text'][idx]
# prepare image (i.e. resize + normalize)
image = Image.open(self.root_dir + file_name).convert("RGB")
pixel_values = self.processor(image, return_tensors="pt").pixel_values
# add labels (input_ids) by encoding the text
labels = self.processor.tokenizer(text,
padding="max_length",
max_length=self.max_target_length).input_ids
# important: make sure that PAD tokens are ignored by the loss function
labels = [label if label != self.processor.tokenizer.pad_token_id else -100 for label in labels]
# encoding
labels = labels[1:]
return {"pixel_values": pixel_values.squeeze(), "labels": torch.tensor(labels)}