Search code examples
python-3.xgoogle-cloud-platformcloud-document-ai

How can I ensure that GCP Document AI model to output JSON with the same name as the input file?


I am using Python to BatchProcess PDFs through GCP Document AI ("DocAI"). The PDFs have long file names such as 71.169892_01-2022.10.15-21275188-1111.pdf. Often the only difference between the filenames are the last four digits before .pdf (such as 71.169892_01-2022.10.15-21275188-1111.pdf and 71.169892_01-2022.10.15-21275188-2547.pdf)

When such a PDF is processed through DocAI, it outputs one or more JSON files with a shortened filename such as 71.169892_01-2022.10-0.json, 71.169892_01-2022.10-1.json, and so on. How can I ensure that DocAI does not cut off the filename? Is there an attribute I can add to BatchProcessing Request to ensure that the output preserves the full filename?

This is important because when I process 2 PDFs with nearly identical filenames (e.g. 71.169892_01-2022.10.15-21275188-1111.pdf and 71.169892_01-2022.10.15-21275188-2547.pdf), the resulting JSONs end up with the same filename: 71.169892_01-2022.10-0.json. Which is a problem when such JSONs are moved from the folder where there are automatically stored by DocAI into the same folder (that is--the second JSON simply overwrites the first JSON which has the same name).

The current state is as follows:

Input PDF: 71.169892_01-2022.10.15-21275188-1111.pdf

Output JSON: 71.169892_01-2022.10-0.json

Expecting:

Input PDF: 71.169892_01-2022.10.15-21275188-1111.pdf

Output JSON: 71.169892_01-2022.10.15-21275188-1111.json


Solution

  • Currently, there isn't a way to specify the output filename from Document AI, other than the output bucket & folder. Batch Processing will always output JSON files with an extra -0 or another number since larger documents can be split up into multiple "shards".

    If it's possible, I would recommend sending the files that have nearly identical names in different requests to avoid the overwriting issue, since each request will output into a different folder named for the operation id.

    However, this is definitely an edge case that should be handled in the product, so I'll report this issue to the development team.

    Update: A fix has been made and it should be rolled out in the next couple of weeks. This should prevent the truncation of the filenames and the overwriting issue, but the output files will still have suffixes like -0.