Search code examples
pythonhuggingface-transformers

How to prepare custom training data for Donut (Document Understanding Transformer)?


I want to train Hugging Face's Donut (Document Understanding Transformer) but I need help in creating the training data.

Donut github: https://github.com/clovaai/donut

Donut official documentation: https://huggingface.co/docs/transformers/main/en/model_doc/donut

If anybody has already created and trained the model, kindly help.


Solution

  • Understanding how to label the training data was a bit confusing in the beginning but after reading this, it became clear.

    Donut treats all tasks as JSON prediction problems. Ensure that your dataset follow this structure:

    dataset_name
    ├── test
    │   ├── metadata.jsonl
    │   ├── {image_path0}
    │   ├── {image_path1}
    │             .
    │             .
    ├── train
    │   ├── metadata.jsonl
    │   ├── {image_path0}
    │   ├── {image_path1}
    │             .
    │             .
    └── validation
        ├── metadata.jsonl
        ├── {image_path0}
        ├── {image_path1}
                  .
                  .
    

    For Document Information Extraction, each line in metadata.jsonl should look like this:

    {"file_name": {image_path0}, "ground_truth": "{\"gt_parse\": {ground_truth_parse}, ... {other_metadata_not_used} ... }"}
    {"file_name": {image_path1}, "ground_truth": "{\"gt_parse\": {ground_truth_parse}, ... {other_metadata_not_used} ... }"}
    

    The model ignores other metadata and focuses on the gt_parse or gt_parses field to predict the JSON task.

    For other tasks like Document Visual Question Answering, refer to the GitHub link.