Empty pages array in Google Document AI API OCR response

I'm currently using the Google Document AI API to extract text from PDFs using OCR. However, I've noticed that the pages array in the OCR response is always empty, even though the OCR operation completes successfully and I'm able to retrieve text from the document.

Here's a simplified version of the code I'm using:

from google.cloud import documentai_v1beta3 as documentai

@classmethod
def extract_text(cls, book_link: str):
    """Extract text from book using OCR"""

    # Upload the book to GCS
    filename = cls._upload_file_to_gcs(book_link=book_link)

    # Create the Batch Process Request
    gcs_input_uri = f"gs://{BUCKET}/input/{filename}"
    operation = cls._create_batch_process_request(gcs_input_uri=gcs_input_uri)

    # Wait for the operation to finish
    try:
        operation.result(timeout=300)
    # Catch exception when operation doesn't finish before timeout
    except (RetryError, InternalServerError) as e:
        raise exceptions.APIException(
            detail={e.message}
        )

    metadata = documentai.BatchProcessMetadata(operation.metadata)

    if metadata.state != documentai.BatchProcessMetadata.State.SUCCEEDED:
        raise exceptions.APIException(
            detail={metadata.state_message}
        )

    output_documents = cls._get_output_documents(metadata=metadata)

    # Delete the input file from GCS
    cls.gcs_bookmapping_bucket.delete_blob(blob_name=f"input/{filename}")

    # Extract text from the output documents
    book_text = []
    for document in output_documents:
        for page in document.pages: # **here document.pages is always empty**
            book_text.append(
                cls._layout_to_text(layout=page.layout, text=document.text)
            )


    return book_text

The document.text attribute contains the text of the entire document, but the pages array is always empty. This is preventing me from extracting text on a per-page basis, which is something I need for my application.

I've double-checked the input PDF files to ensure that they have multiple pages, so I'm confident that the issue is not with the input data.

I'm using documentai_v1beta3, I've also tried documentai_v1 but still it didn't work.

Has anyone else experienced this issue with the Google Document AI API? Any suggestions for how I can retrieve text on a per-page basis?

Thanks in advance for your help.

Solution

Can you provide more information?

Which processor type are you using and which processor version?
Can you link to the full Document JSON output from batch processing and the original input document?
Does this occur with every document, or just a specific one?
Can you also provide the rest of your code?
Are you providing a FieldMask with the input?

My theory is either you are using a processor that doesn't populate the pages array:

You can find sample output files in the documentation

Or you are providing a FieldMask in the request, which limits the fields that are present in the output.

Send a processing request in the documentation shows how to use this field.

On a related note, you can simplify handling the batch process response by using the Document AI Toolbox SDK