python deep-learning pytorch huggingface-transformers huggingface

How to avoid adding double start of token in TrOCR finetune model

Describe the bug The model I am using (TrOCR Model):

The problem arises when using:

[x] the official example scripts: done by the nice tutorial (fine_tune) @NielsRogge
[x] my own modified scripts: (as the script below )

processor = TrOCRProcessor.from_pretrained("microsoft/trocr-large-handwritten")

class Dataset(Dataset):
    def __init__(self, root_dir, df, processor, max_target_length=128):
        self.root_dir = root_dir
        self.df = df
        self.processor = processor
        self.max_target_length = max_target_length

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        # get file name + text 
        file_name = self.df['file_name'][idx]
        text = self.df['text'][idx]
        # prepare image (i.e. resize + normalize)
        image = Image.open(self.root_dir + file_name).convert("RGB")
        pixel_values = self.processor(image, return_tensors="pt").pixel_values
        # add labels (input_ids) by encoding the text
        labels = self.processor.tokenizer(text, 
                                          padding="max_length",
                                                         max_length=self.max_target_length).input_ids
        # important: make sure that PAD tokens are ignored by the loss function
        labels = [label if label != self.processor.tokenizer.pad_token_id else -100 for label in labels]
        # encoding  
        return {"pixel_values": pixel_values.squeeze(), "labels": torch.tensor(labels)}

model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-large-handwritten")
model.config.decoder_start_token_id = processor.tokenizer.cls_token_id
model.config.pad_token_id = processor.tokenizer.pad_token_id
model.config.vocab_size = model.config.decoder.vocab_size

model.config.eos_token_id = processor.tokenizer.sep_token_id


# python3 train.py path/to/labels  path/to/images/

Platform: Linux Ubuntu distribution [GCC 9.4.0] on Linux
PyTorch version (GPU?): 0.8.2+cu110
transformers: 4.22.2
Python version:3.8.10

A clear and concise description of what the bug is. To Reproduce Steps to reproduce the behavior:

After training the model or during the training phase when evaluating metrics calculate I see the model added the double start of token <s><s> or ids [0,0, ......,2,1,1, 1 ]
here is an example during the training phase the show generated tokens in compute_metrics Input predictions: [[0,0,506,4422,8046,2,1,1,1,1,1]] Input references: [[0,597,2747 ...,1,1,1]]
Other examples during testing models []

Expected behavior A clear and concise description of what you expected to happen. In 2 reproduced problems: I am expecting during training Input predictions: [[,0,,506,4422,8046,2,1,1,1,1,1 ]]

In addition during the testing phase: generated text without double tensor([[0,11867,405,22379,1277,..........,368,2]])

<s>ennyit erről, tőlem fényképezz amennyit akarsz, a véleményem akkor</s>

Solution

The problem comes from passed token ids. I am adding start token from tokenizer + another start token from the TrOCR model so the duplication happens. the solution is super easy by just skipping start token coming from the tokenizer by using labels = labels[1:]

def __getitem__(self, idx):
        # get file name + text 
        file_name = self.df['file_name'][idx]
        text = self.df['text'][idx]
        # prepare image (i.e. resize + normalize)
        image = Image.open(self.root_dir + file_name).convert("RGB")
        pixel_values = self.processor(image, return_tensors="pt").pixel_values
        # add labels (input_ids) by encoding the text
        labels = self.processor.tokenizer(text, 
                                          padding="max_length",
                                                                                    max_length=self.max_target_length).input_ids
        # important: make sure that PAD tokens are ignored by the loss function
        labels = [label if label != self.processor.tokenizer.pad_token_id else -100 for label in labels]
        # encoding
        labels = labels[1:]   
        return {"pixel_values": pixel_values.squeeze(), "labels": torch.tensor(labels)}