Is there a way to parse the Document AI OCR response into pdf format?

I am passing scanned PDFs into the Google Cloud Document AI OCR. The JSON response (or the Document object returned when using the Python API) contains the content of the PDF in a structured format, as described here. I would like to be able to output a PDF file as well (or XML if that's easier). Is there such a functionality? Any hints on possible implementations are appreciated.

Note: the PDFs are already OCRed by another tool prior to my tasks, but the quality is not as good as the Document AI OCR.

Thank you

Solution

Sharing if anyone else is looking for this. I found this repository gcv2hocr which has a script to convert the Google Cloud Vision response (for image input) to hOCR format. The hOCR output can then be converted to other formats, including PDF using hocr-tools.

I suppose it would not be very difficult to adapt this code to work with the DocumentAI response.

Why would you use 'extern "C++"'?
Strange Behavior Compiler Ignoring NULL Check Unless I Print Something in the if Statement
Fast inverse square root using fixed point instead of floating point
What is the const qualifier attached to in C: the memory area or the pointer?
What is the scope of `fesetround()`?
Is this declaration UB?
GCC options for strictest C code?
How to do an explicit fall-through in C
How do compilers treat CONST qualifier when the pointer points to a memory location obtained with malloc()?
C: cmocka headers - how to unittest?
Why in C when I print a double with a one decimal it round it to the next number
Android C to Java SWIG unable to compile: incompatible types: byte cannot be converted to SWIGTYPE_p_uint8_t
GNU Make in Ubuntu giving fatal error: rpc/types.h: No such file or directory
How can I exclude non-numeric keys? CS50 Caesar Pset2
How change every struct in an array of pointers?
Optimized 2x2 matrix multiplication: Slow assembly versus fast SIMD
Simple frame by frame video decoder library
GCC no longer implements <varargs.h>
Contents of IO buffer unknown == unsafe?
Avoiding strcpy overflow destination warning
Sort program not working, not sure why
Fast & accurate atan/arctan approximation algorithm
What's the difference between strtok_r and strtok_s in C?
How memory address for pointer to arrays is same as an element in 2D array?
Which is the best way to suppress "unused variable" warning
How to use ellipsis in c's case statement?
Fast ceiling of an integer division in C / C++
Is there an invalid pthread_t id?
How to Implement Universal Setter/Getter Functions for Interrupt-Driven Variables in Embedded C?
How does SIMD (avx) processing work? for example, if I want 10 32 bit floats how do i fit in a 256 bit avx vector?