Search code examples
google-cloud-platformcloud-document-aigoogle-workflows

Response from Document AI stored in Google Cloud Storage


I am using a GCP workflow and eventarc trigger connected to cloud storage to have a document evaluated by Document AI when the cloud storage bucket receives it. The issue I'm encountering is, whenever I try and evaluate the document, I get an error stating memory limit exceeded. My test document's size is 120ish kilobytes, and upon research, the workflow can handle a response size of up to 2MB. Originally, I thought it was because I was trying to log the response, just to see what it looked like, so I switched to having it stored in a separate storage bucket, but I've continued getting the same error. Is it because I need to compress the response coming from Document AI and THEN try and save it into the bucket, because the response is too large? Below is my current YAML code:

main:
  params: [event]
  steps:
    - start:
        call: sys.log
        args:
          text: ${event}
    - vars:
        assign:
          - file_name: ${event.data.name}
          - mime_type: ${event.data.contentType}
          - input_gcs_bucket: ${event.data.bucket}
    - batch_doc_process:
        call: googleapis.documentai.v1.projects.locations.processors.process
        args:
          name: ${"projects/" + sys.get_env("GOOGLE_CLOUD_PROJECT_ID") + "/locations/" + sys.get_env("LOCATION") + "/processors/" + sys.get_env("PROCESSOR_ID")}
          location: ${sys.get_env("LOCATION")}
          body:
            gcsDocument:
              gcsUri: ${"gs://" + input_gcs_bucket + "/" + file_name}
              mimeType: ${mime_type}
            skipHumanReview: true
        result: doc_process_resp
    - store_process_resp:
        call: googleapis.storage.v1.objects.insert
        args:
          bucket: ${sys.get_env("OUTPUT_GCS_BUCKET")}
          name: ${file_name}
          body: ${doc_process_resp}

Solution

  • I just had to change it up and use a batch process request instead of a single doc process request. This was so I could specify the storage bucket to send it to after it was done being processed. So we go from this function:

    - batch_doc_process:
            call: googleapis.documentai.v1.projects.locations.processors.process
            args:
              name: ${"projects/" + sys.get_env("GOOGLE_CLOUD_PROJECT_ID") + "/locations/" + sys.get_env("LOCATION") + "/processors/" + sys.get_env("PROCESSOR_ID")}
              location: ${sys.get_env("LOCATION")}
              body:
                gcsDocument:
                  gcsUri: ${"gs://" + input_gcs_bucket + "/" + file_name}
                  mimeType: ${mime_type}
                skipHumanReview: true
            result: doc_process_resp
    

    to this one:

    - batch_doc_process:
            call: googleapis.documentai.v1.projects.locations.processors.batchProcess
            args:
              name: ${"projects/" + sys.get_env("GOOGLE_CLOUD_PROJECT_ID") + "/locations/" + sys.get_env("LOCATION") + "/processors/" + sys.get_env("PROCESSOR_ID")}
              location: ${sys.get_env("LOCATION")}
              body:
                inputDocuments:
                  gcsDocuments:
                    documents: 
                      - gcsUri: ${"gs://" + input_gcs_bucket + "/" + file_name}
                        mimeType: ${mime_type}
                documentOutputConfig:
                  gcsOutputConfig:
                    gcsUri: ${sys.get_env("OUTPUT_GCS_BUCKET")}
                skipHumanReview: true
            result: doc_process_resp
    

    A small change, but one that actually allows it to work propely