google cloud vision api, how to read text and structure it

I'm using google cloud vision api python to scan document to read the text from it. Document is an invoice which has customer details and tables. Document to text data conversion works perfect. However the data is not sorted. I'm not able to find a way how to sort the data because I need to extract few values from it. And the data which I want to extract is located sometimes in different position which is making me difficult to extract.

https://cloud.google.com/vision/docs/fulltext-annotations

Here is my python code:

import io
import os
from google.cloud import vision
from google.cloud.vision import types
import glob


def scan_img(image_file):
    with io.open(image_file, 'rb') as image_file:
        content = image_file.read()

    image = types.Image(content=content)

    response = client.document_text_detection(image=image)
    document = response.full_text_annotation
    img_out_array = document.text.split("\n")
    invoice_no_raw = ""
    invoice_date_raw = ""
    net_total_idx = ""
    customer_name_index = ""

    for index, line in enumerate(img_out_array):
        if "Invoice No" in line:
            invoice_no_raw = line
        if "Customer Name" in line:
            index += 6
            customer_name_index = index
        if "Date :" in line:
            invoice_date_raw = line
        if "Our Bank details" in line:
            index -= 1
            net_total_idx = index

    net_total_sales_raw = img_out_array[net_total_idx]
    customer_name_raw = img_out_array[customer_name_index]
    print("Raw data:: ", invoice_no_raw, invoice_date_raw, customer_name_raw, img_out_array[net_total_idx])

    invoice_no = invoice_no_raw.split(":")[1]
    invoice_date = invoice_date_raw.split(":")[1]
    customer_name = customer_name_raw.replace("..", "")
    net_total_sales = net_total_sales_raw.split(" ")[-1]

    return [invoice_no, invoice_date, customer_name, net_total_sales]


if __name__ == '__main__':
    os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = 
    "path/to/imgtotext.json"
    client = vision.ImageAnnotatorClient()
    images = glob.glob("/path/Documents/invoices/*.jpg")
    for image in images:
        print("scanning the image:::::" + image)
        invoice_no, invoice_date, customer_name, net_total_sales = 
        scan_img(image)
        print("Formatted data:: ", invoice_no, invoice_date, 
        customer_name, net_total_sales)

document 1 output:

Customer Name
Address
**x customer**
area name
streetname
Customer LPO

document 2 output :

Customer LPO
**y customer**
area name
streetname
LPO Date
Payment Terms
Customer Name
Address
Delivery Location

Please advice, I want to read the x and y customer and this location is changing from document to document and I have several documents. How to structure it and read the data.

There are other several fields which I'm able successfully read it.

Thanks in advance.

Solution

Cloud Vision API doesn't have a specific request property to specify the format used to read or sort the file's data. Instead, I think that the available workaround is to use the BoundingPoly and Vertex response properties, that display the coordinates related to each word contained in the image, in order to process the vertices data within your code logic and define the text that need to be grouped by columns and rows. You can take a look on this link which includes some response examples that include these properties.

In case this feature doesn't cover your current needs, you can use the Send Feedback button, located at the lower left and upper right corners of the service public documentation, as well as take a look the Issue Tracker tool in order to raise a Vision API feature request and notify to Google about this desired functionality.