Search code examples
amazon-web-servicesocramazon-textract

AWS Textract (OCR) not detecting some cells


I am using AWS Textract to read and parse tables from PDF into CSV. Lovely, AWS has a documentation for it! https://docs.aws.amazon.com/textract/latest/dg/examples-export-table-csv.html

I have set up the Asynchronous method as they suggest, and it works for a POC. However, for some documents, some lines are not shown in my csv. After digging a bit into the json produced (the issue is persistent if I use AWS CLI to make the document analysis), I noticed that the values missing have no CELL block referenced. Those missing values are referenced into WORD block, and LINE block, but not in CELL block. According to the script, that's exactly the reason why it's not added to my csv.

We could assume it's not that good OCR algorithm. But the fun fact about this, is that if I use the same pdf within AWS Textract console, all the data is parsed into the table! Is any of you aware of any parameters I would need to use to be sure to detect the values as CELL? Or do you think behind the scenes, they simply use a more powerful script (that would actually use the (x,y) coordinates of each WORD to match the table?

I also compared the json produced from CLI to the one from the console, and it's actually different! (not only IDs, but also as said some values are in CELL's block for console, while in LINE/WORD only for CLI)

Important fact: my PDF is 3 pages long. The first page is working perfectly fine with all the values, but the second one is missing the first 10 lines of the table basically. After those 10 lines, everything is parsed on this page as well.

Any suggestions? Or script to parse more efficiently the json provided?

Thank you!


Solution

  • Update: Basically the issue was the pagination of the results. There is a maximum of 1000 objects according to AWS documentation: https://docs.aws.amazon.com/textract/latest/dg/API_GetDocumentAnalysis.html#API_GetDocumentAnalysis_RequestSyntax

    If you have more than this amount of object in the single table, then the IDs are in the first 1000, while the object itself is referenced in second batch (1001 --> 2000). So when trying to add the cell to the table, it can't find the reference.

    Basically the solution is quite easy. We need to alter the GetResults function to concatenate each response, and THEN run the other functions.

    Here is a functioning code:

    def GetResults(jobId, file_name):
        maxResults = 1000
        paginationToken = None
        finished = False
        blocks = []
    
        while finished == False:
            response = None
            if paginationToken == None:
                response = textract.get_document_analysis(JobId=jobId, MaxResults=maxResults)
            else:
                response = textract.get_document_analysis(JobId=jobId, MaxResults=maxResults,
                                                               NextToken=paginationToken)
            blocks += response['Blocks']
            if 'NextToken' in response:
                paginationToken = response['NextToken']
            else:
                finished = True
    
        table_csv = get_table_csv_results(blocks)
        output_file = file_name + ".csv"
        # replace content
        with open(output_file, "w") as fout: # Important to change "at" to "w"
            fout.write(table_csv)
        # show the results
        print('Detected Document Text')
        print('Pages: {}'.format(response['DocumentMetadata']['Pages']))
        print('OUTPUT TO CSV FILE: ', output_file)
    

    Hope this will help people.