Search code examples
google-cloud-platformocrcloud-document-ai

Google DocumentAI not output in the right order of blocks


I'm trying to OCR this image using Google's DocumentAI. But it seems to output the text in completely wrong orders.

Here's the image:

Image to be OCRed

The output is as follows:

piece of clothing,
= do up
zip up something
with difficulty.
zip something up
you fasten it using a zip.
She zipped up the dress
Hezipped his jeans up.

It seems to first split the image in half, which it shouldn't, then read the text respectively. But the image is already splitted, and is intended to be read line by line.

How to tell the DocumentAI to just read the image line by line?

This is the python code I'm using:

    def quickstart(
    project_id: str, location: str, processor_id: str, file_path: str, mime_type: str, processor_version_id: str = None
):
    # You must set the api_endpoint if you use a location other than 'us'.
    opts = ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")

    client = documentai.DocumentProcessorServiceClient(client_options=opts)

    # The full resource name of the processor, e.g.:
    # projects/project_id/locations/location/processor/processor_id
    # name = client.processor_path(project_id, location, processor_id)

    if processor_version_id:
        # The full resource name of the processor version, e.g.:
        # projects/{project_id}/locations/{location}/processors/{processor_id}/processorVersions/{processor_version_id}
        name = client.processor_version_path(
            project_id, location, processor_id, processor_version_id
        )
    else:
        # The full resource name of the processor, e.g.:
        # projects/{project_id}/locations/{location}/processors/{processor_id}
        name = client.processor_path(project_id, location, processor_id)


    # Read the file into memory
    with open(file_path, "rb") as image:
        image_content = image.read()

    # Load Binary Data into Document AI RawDocument Object
    raw_document = documentai.RawDocument(content=image_content, mime_type=mime_type)

    # Configure the process request
    request = documentai.ProcessRequest(name=name, raw_document=raw_document)

    result = client.process_document(request=request)

    # For a full list of Document object attributes, please reference this page:
    # https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.Document
    document = result.document

    # Read the text recognition output from the processor
    f.write(file_path + "\n")
    f.write(document.text)

Solution

  • Document AI OCR may include paragraph/block text in a different order than expected in the API response due to the varying ways that text can be portrayed on a page.

    You can retrieve the bounding box information using the information in Document.paragraphs[].layout.boundingPoly to determine which order to handle the text. (E.g. top to bottom, left to right, etc.)

    You can refer to handle the processing response for more information on how this response is structured. You can also try this demo to see how the blocks and paragraphs are extracted.