Search code examples
google-cloud-platformcloud-document-ai

Document AI batch process operation different payload returned


I'm working on a problem to split documents using document AI. In this problem I'm following the official github repo by document AI for batch processing.

The batch process function returns a long running operation. The operation is then polled and populated using the operation metadata

when I try to retrieve the operation separately using long running operation api I'm getting the operation object but the metadata is different so I'm not able to process the document further.

For retrieving the operation later I'm using the get_operation function in same repo.

Thanks in advance!!

repository link: https://github.com/GoogleCloudPlatform/python-docs-samples/tree/239d42f8dcb564db35c0b9fc79d8c07f6f6fe489/documentai/snippets

the batch_process_sample works fine but need to get same result even if I retrieve the operation object separately and try to process similarly


Solution

  • The Operation data returned from get_operation() is in a slightly different format than how it's returned directly from batch_process_documents(). This seems to be a quirk of how Google APIs handle operations.

    The code sample and documentation don't include info about this, but I figured out how to do it using the built in methods. (I'm in the process of adding features to the Document AI Toolbox SDK that pulls the Document output from the GCS URIs in BatchProcessMetadata or from an Operation name to make this easier.

    Update: Code for Document AI Toolbox

    from google.cloud import documentai
    from google.cloud.documentai_toolbox import document
    
    project_id = "YOUR_PROJECT_ID"
    location = "YOUR_PROCESSOR_LOCATION"
    
    operation = client.batch_process_documents(request)
    # Format: projects/{project_id}/locations/{location}/operations/15842030886767182557
    operation_name = operation.operation.name
    
    # Use this wrapped document to get the extraction information you need.
    wrapped_document = document.from_batch_process_operation(location, operation_name)
    

    Main APIs

    from google.api_core.client_options import ClientOptions
    from google.cloud import documentai
    from google.longrunning.operations_pb2 import GetOperationRequest
    
    project_id = "YOUR_PROJECT_ID"
    location = "YOUR_PROCESSOR_LOCATION"
    operation_name = (
        f"projects/{project_id}/locations/{location}/operations/15842030886767182557"
    )
    client = documentai.DocumentProcessorServiceClient(
        client_options=ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")
    )
    
    while True:
        operation = client.get_operation(
            request=GetOperationRequest(name=operation_name)
        )
    
        if operation.done:
            break
    
    # The BatchProcessMetadata information is serialized, must be deserialized to access the values
    metadata = documentai.BatchProcessMetadata.deserialize(operation.metadata.value)
    
    # Get the individual_process_statuses
    for process in list(metadata.individual_process_statuses):
        # Handle the response however you need
        print(process.output_gcs_destination)