I am using OCR tool provided by Google's Document AI to extract text from images such as the one given below:
My goal is to create a dataframe with all the metrics given such as Rate, RR, PR as columns that are filled in with the succeeding values. This can easily be done with the help of a regular expression that records numeric values present after a keyword such as "Rate". The problem, however, is when I use the document AI OCR to extract the text, it prints it as follows:
How or what can I do to enable the document ocr to print the text exactly as seen in the image?
Any help would be greatly appreciated :)
For this specific example, it would make sense to try the Form Parser processor as suggested by kiran-mathew.
This processor can detect key-value pairs and tables, which can be extracted from the Document AI response with the code samples shown in Handle the processing response > Forms and Tables.
The Document AI Toolbox Python SDK also has built in functions for converting extracted tables to Pandas Dataframes.
In a more general case, you can get the layout bounding box detected by Document AI in the boundingPoly
field for paragraphs
, lines
, tokens
, etc. This contains the coordinates for each element on the page. Handle the processing response > Text, layout, and quality scores describes the structure.