python google-cloud-platform ocr google-api-python-client cloud-document-ai

Google Document Ai giving different outputs for the same file

I was using Document OCR API to extract text from a pdf file, but part of it is not accurate. I found that the reason may be due to the existence of some Chinese characters.

The following is a made-up example in which I cropped part of the region that the extracted text is wrong and add some Chinese characters to reproduce the problem.

Input file

When I use the website version, I cannot get the Chinese characters but the remaining characters are correct.

Result from website version OCR

When I use Python to extract the text, I can get the Chinese characters correctly but part of the remaining characters are wrong.

Result from program

The actual string that I got.

Actual result

Are the versions of Document AI in the website and API different? How can I get all the characters correctly?

Update:

When I print the detected_languages (don't know why for lines = page.lines, the detected_languages for both lines are empty list, need to change to page.blocks or page.paragraphs first) after printing the text, I get the following output.

language code

Code:

from google.cloud import documentai_v1beta3 as documentai

project_id= 'secret-medium-xxxxxx'
location = 'us' # Format is 'us' or 'eu'
processor_id = 'abcdefg123456' #  Create processor in Cloud Console

opts = {}
if location == "eu":
    opts = {"api_endpoint": "eu-documentai.googleapis.com"}
client = documentai.DocumentProcessorServiceClient(client_options=opts)

def get_text(doc_element: dict, document: dict):
    """
    Document AI identifies form fields by their offsets
    in document text. This function converts offsets
    to text snippets.
    """
    response = ""
    # If a text segment spans several lines, it will
    # be stored in different text segments.
    for segment in doc_element.text_anchor.text_segments:
        start_index = (
            int(segment.start_index)
            if segment in doc_element.text_anchor.text_segments
            else 0
        )
        end_index = int(segment.end_index)
        response += document.text[start_index:end_index]
    return response

def get_lines_of_text(file_path: str, location: str = location, processor_id: str = processor_id, project_id: str = project_id):

    # You must set the api_endpoint if you use a location other than 'us', e.g.:
    # opts = {}
    # if location == "eu":
    #     opts = {"api_endpoint": "eu-documentai.googleapis.com"}

    # The full resource name of the processor, e.g.:
    # projects/project-id/locations/location/processor/processor-id
    # You must create new processors in the Cloud Console first
    name = f"projects/{project_id}/locations/{location}/processors/{processor_id}"

    # Read the file into memory
    with open(file_path, "rb") as image:
    image_content = image.read()

    document = {"content": image_content, "mime_type": "application/pdf"}

    # Configure the process request
    request = {"name": name, "raw_document": document}

    result = client.process_document(request=request)
    document = result.document

    document_pages = document.pages

    response_text = []
    # For a full list of Document object attributes, please reference this page: https://googleapis.dev/python/documentai/latest/_modules/google/cloud/documentai_v1beta3/types/document.html#Document

    # Read the text recognition output from the processor
    print("The document contains the following paragraphs:")
    for page in document_pages:
        lines = page.blocks
        for line in lines:
            block_text = get_text(line.layout, document)
            confidence = line.layout.confidence
            response_text.append((block_text[:-1] if block_text[-1:] == '\n' else block_text, confidence))
            print(f"Text: {block_text}")
            print("Detected Language", line.detected_languages)
    return response_text

if __name__ == '__main__':
    print(get_lines_of_text('/pdf path'))

It seems the language code is wrong, will this affect the result?

Solution

Posting this Community Wiki for better visibility.

One of features of DocumentAI is OCR - Optical Character Recognition which allows recognizing text from various files.

OP in this scenario received difference outputs using Try it function and Client Libraries - Python.

Why are there discrepancies between Try it and Python library? It's hard to say as both methods use the same API documentai_v1beta3. It might be related to some files modifications when pdf is uploading to Try it Demo, different endpoints, language alphabet recognition or some random stuff.

When you are using Python Client you also get accuracy % of text identification. Below examples from my testes:

However, OP's identification is about 0,73 so it might get wrong results and in this situation is a visible issue. I guess it cannot be anyhow improved using code. Maybe if there would be different quality of PDF (in shown OPs example there are some dots which might affect identification).