Search code examples
amazon-web-servicespdfamazon-s3amazon-textract

Upload PDF File and analyze by Textract without uploading the file in S3 Bucket


Im Planning to create a program from laravel where in you can upload your pdf file and analyze it with Textract OCR. I want the user to upload the pdf file and analyze it with textract without uploading the PDF in S3 bucket. My Question is, Is that possible? or I really need to upload in s3 bucket first before it can analyze by textract?. because most of the tutorial I'm seeing in the internet the pdf file is in s3 bucket.

Thanks


Solution

  • The PDF file has to be uploaded to an S3 bucket. It does not mean that is has to be there forever. You could for example add a lifecycle rule on the bucket as a safeguard to delete all files after 1 day, in case you run into a problem deleting the file after processing.

    The flow is asynchronous by the way:

    • Upload a file to to S3.
    • Call the Textract API to request analysis of the S3 object, providing an SNS topic where the result will be posted.
    • When the result is posted to the queue, you can get the message by polling, but the best solution is to subscribe a lambda to the topic that is invoked when a message is received. Your lambda could then process the JSON response, store information as needed, and delete the object in the S3 bucket.