I’m currently redesigning documents and forms for improving the ease of extraction using Aws textract.
Do you have experiences and best practices to share?
Regards
Here's some recommended best practices from Amazon Textract Developer Guide in order to Provide an Optimal Input Document :
The following is a list of a few ways that you can optimize your input documents for better results.
- Ensure that your document text is in a language that Amazon Textract supports. Currently, Amazon Textract supports English, Spanish, German, Italian, French, and Portuguese.
- Provide a high quality image, ideally at least 150 DPI.
- If your document is already in one of the file formats that Amazon Textract supports (PDF, TIFF, JPEG,and PNG), don't convert or downsample the document before uploading it to Amazon Textract.
For the best results when extracting text from tables in documents, ensure that:
- Tables in your document are visually separated from surrounding elements on the page. For example, the table isn't overlaid onto an image or complex pattern.
- Text within the table is upright. For example, the text isn't rotated relative to other text on the page. When extracting text from tables, you might see inconsistent results when:
- Merged table cells that span multiple columns.
- Tables with cells, rows, or columns that are different from other parts of the same table.
I highly suggest you to take a look at the Developer Guide.