I want to train Hugging Face's Donut (Document Understanding Transformer) but I need help in creating the training data.
Donut github: https://github.com/clovaai/donut
Donut official documentation: https://huggingface.co/docs/transformers/main/en/model_doc/donut
If anybody has already created and trained the model, kindly help.
Understanding how to label the training data was a bit confusing in the beginning but after reading this, it became clear.
Donut treats all tasks as JSON prediction problems. Ensure that your dataset follow this structure:
dataset_name
├── test
│ ├── metadata.jsonl
│ ├── {image_path0}
│ ├── {image_path1}
│ .
│ .
├── train
│ ├── metadata.jsonl
│ ├── {image_path0}
│ ├── {image_path1}
│ .
│ .
└── validation
├── metadata.jsonl
├── {image_path0}
├── {image_path1}
.
.
For Document Information Extraction, each line in metadata.jsonl should look like this:
{"file_name": {image_path0}, "ground_truth": "{\"gt_parse\": {ground_truth_parse}, ... {other_metadata_not_used} ... }"}
{"file_name": {image_path1}, "ground_truth": "{\"gt_parse\": {ground_truth_parse}, ... {other_metadata_not_used} ... }"}
The model ignores other metadata and focuses on the gt_parse or gt_parses field to predict the JSON task.
For other tasks like Document Visual Question Answering, refer to the GitHub link.