Search code examples
cloud-document-ai

Is there a way to parse the Document AI OCR response into pdf format?


I am passing scanned PDFs into the Google Cloud Document AI OCR. The JSON response (or the Document object returned when using the Python API) contains the content of the PDF in a structured format, as described here. I would like to be able to output a PDF file as well (or XML if that's easier). Is there such a functionality? Any hints on possible implementations are appreciated.

Note: the PDFs are already OCRed by another tool prior to my tasks, but the quality is not as good as the Document AI OCR.

Thank you


Solution

  • Sharing if anyone else is looking for this. I found this repository gcv2hocr which has a script to convert the Google Cloud Vision response (for image input) to hOCR format. The hOCR output can then be converted to other formats, including PDF using hocr-tools.

    I suppose it would not be very difficult to adapt this code to work with the DocumentAI response.