Search code examples
cloud-document-ai

How to get data is oraganised way from pdf in document ai [cloud-document-ai]


I had created following schema. Schema

I am getting all the data in main object or root position, as it can been seen that data belongs to two different person, it should be differentiated individually.

Document

Is it possible to get data in following structure in document ai.

[{ name: "XXX", id: "XXX", line_items: [{ shift_date: "xxx", shift_duration: "xxx", ... }] }, { name: "XXX", id: "XXX", line_items: [{ shift_date: "xxx", shift_duration: "xxx", ... }] } ]


Solution

  • For clarity, it looks like you're using Custom Document Extractor.

    Document AI won't separate the entities in the way you're asking. All of the extracted entities will be in the Entities field in the Document output.

    If you make nested entities, which is how shift_detail is setup, they will be in the properties field of the parent entity. However, only one layer of nesting is supported for now.

    You can do some post-processing after sending the document to the processor to separate the entities for the distinct sections of the document. For example, the Entity.id field will usually increase from the top to bottom of the page, so that could be used with the "ID" or "Full Name" field on the document to determine which section it corresponds to. Or you could use the page bounding boxes to determine place on the page.

    Another option is to split the page by timesheet and send each timesheet to the processor separately to avoid needing to separate the entities in post-processing.