How to prepare custom training data for Donut (Document Understanding Transformer)?

I want to train Hugging Face's Donut (Document Understanding Transformer) but I need help in creating the training data.

Donut github: https://github.com/clovaai/donut

Donut official documentation: https://huggingface.co/docs/transformers/main/en/model_doc/donut

If anybody has already created and trained the model, kindly help.

Solution

Understanding how to label the training data was a bit confusing in the beginning but after reading this, it became clear.

Donut treats all tasks as JSON prediction problems. Ensure that your dataset follow this structure:

dataset_name
├── test
│   ├── metadata.jsonl
│   ├── {image_path0}
│   ├── {image_path1}
│             .
│             .
├── train
│   ├── metadata.jsonl
│   ├── {image_path0}
│   ├── {image_path1}
│             .
│             .
└── validation
    ├── metadata.jsonl
    ├── {image_path0}
    ├── {image_path1}
              .
              .

For Document Information Extraction, each line in metadata.jsonl should look like this:

{"file_name": {image_path0}, "ground_truth": "{\"gt_parse\": {ground_truth_parse}, ... {other_metadata_not_used} ... }"}
{"file_name": {image_path1}, "ground_truth": "{\"gt_parse\": {ground_truth_parse}, ... {other_metadata_not_used} ... }"}

The model ignores other metadata and focuses on the gt_parse or gt_parses field to predict the JSON task.

For other tasks like Document Visual Question Answering, refer to the GitHub link.