Search code examples
pythonamazon-web-servicescsvamazon-textract

AWS Textract table extraction broke rows with integers that has comma inside it into another column


I would like to use AWS Textract to convert my image into tables in python and download it as CSV.

So, I followed the documentation and examples code from AWS here: https://github.com/awsdocs/aws-doc-sdk-examples/blob/master/python/example_code/textract/textract_python_table_parser.py

Apparently the code in the link above will separate the commas in the integers into another column. I will explain with image and steps to reproduce the error below:

So this is the example of my table in image form. enter image description here

If you want to reproduce the error, clone the code in the github repo and type the following code in your cmd/terminal

python textract_python_table_parser.py <your-image-filename.png>

The error is as attached below:

enter image description here

As you can see in the ["Amount (USD)"] column, values with commas inside it will break into the ["Transaction Date"] column. Even I read the csv file in pandas also didn't work.

I wonder if which line of code in the GitHub repo broke the comma separation into another column


Solution

  • Just found out that in the GitHub link, for line 114, just add "" to the curly bracket:

    csv += '"{}"'.format(text) + ","
    

    The reason is to transform all the texts into string so CSV won't take the commas inside the string into consideration during formating.