google-cloud-platform ocr tesseract google-cloud-vision cloud-document-ai

How can I preserve the text spatial integrity while using the OCR tool from google's cloud vision or document AI?

I am using OCR tool provided by Google's Document AI to extract text from images such as the one given below:

My goal is to create a dataframe with all the metrics given such as Rate, RR, PR as columns that are filled in with the succeeding values. This can easily be done with the help of a regular expression that records numeric values present after a keyword such as "Rate". The problem, however, is when I use the document AI OCR to extract the text, it prints it as follows:

How or what can I do to enable the document ocr to print the text exactly as seen in the image?

Any help would be greatly appreciated :)

Solution

For this specific example, it would make sense to try the Form Parser processor as suggested by kiran-mathew.

This processor can detect key-value pairs and tables, which can be extracted from the Document AI response with the code samples shown in Handle the processing response > Forms and Tables.

The Document AI Toolbox Python SDK also has built in functions for converting extracted tables to Pandas Dataframes.

In a more general case, you can get the layout bounding box detected by Document AI in the boundingPoly field for paragraphs, lines, tokens, etc. This contains the coordinates for each element on the page. Handle the processing response > Text, layout, and quality scores describes the structure.