I am passing scanned PDFs into the Google Cloud Document AI OCR. The JSON response (or the Document object returned when using the Python API) contains the content of the PDF in a structured format, as described here. I would like to be able to output a PDF file as well (or XML if that's easier). Is there such a functionality? Any hints on possible implementations are appreciated.
Note: the PDFs are already OCRed by another tool prior to my tasks, but the quality is not as good as the Document AI OCR.
Thank you
Sharing if anyone else is looking for this. I found this repository gcv2hocr which has a script to convert the Google Cloud Vision response (for image input) to hOCR format. The hOCR output can then be converted to other formats, including PDF using hocr-tools.
I suppose it would not be very difficult to adapt this code to work with the DocumentAI response.