Search code examples
amazon-web-servicesamazon-cloudsearchamazon-textract

AWS: Textract with Cloudsearch


I am in the process of creating a project that uses some AWS services (for training purposes).

Now as you all know AWS has a lot of different services and it requires some knowledge on how to use them and what you can use them for. That is why i am posting the question here:

My idea

I want to create an application where users can upload their PDF / Images then process them with AWS Textract and then be able to search their documents in a smart way.

Now the trick here is:

  • All documents are not structured the same way
  • Each user of the application have their own documents (which should be private)

So after reading a lot of documentation here is the solution i came up with using AWS Textract and AWS CloudSearch

enter image description here

A client uploads their documents to my service. that document is then saved and processed by AWS Textract and the output is stored in a database.

Searching

enter image description here

Now, this is where I am in doubt. I want the user to be able to search his private documents. I've been looking at cloudsearch however I am not 100% sure of its capabilities when documents are so different and unique.

So I guess my question is what is the best way to search these unique documents?


Solution

  • There are a some things to note before I can even attempt to answer your question:

    So I guess my question is what is the best way to search these unique documents?

    First, Textract gives a JSON output in the following format:

    "BlockType": "LINE",
          "Confidence": 99.71240997314453,
          "Text": "Previous Employer: None",
          "Geometry": {…}
    

    This means that the field CloudSearch would see is "Text", it wouldn't see "Previous Employer" as a field, just part of a text string. (I'll come back to this).

    You should also note that Textract, like any other OCR is not perfect, so you should expect a certain amount of errors, e.g. the above text field may be read as

    "Text": "Previous Employee: None"
    

    It's also possible that it could split or concatenate things in ways you wouldn't expect. So you could end up with two separate LINE blocks with "Previous Employer:" and "None".

    In regards to privacy, a single CloudSearch domain has no concept of person As data or person Bs data. There is just data.

    There are certain things you could do in Lambda or API Gateway that could limit queries or results of certain fields, but that wouldn't hide person As data from person B. Plus, because we are using Textract, everything will be under the field "Text" anyway.

    Your only alternative would be to have a separate CloudSearch Domain for each user. That is a genuine possibility as long as you have 100 or less customers and you don't have other CloudSearch Domains for other things because the AWS limits the amount of Domains per account to 100. See Understanding Amazon CloudSearch Limits for further info.

    While we are on the subject of CloudSearch limits, each document can only have 200 fields. Even a small document can lead to an Textract JSON response with over 200 fields. A single LINE block has 18 fields because of all of the fields under "Geometry". So you would probably need to iterate over any data generated by Textract to strip out the fields that would not be of any use to you before it is uploaded to CloudSearch.

    So to the actual search. You will only really have one field to search, which is "Text". This means that you won't be able to index any of the documents uploaded, which will have an effect on query response times.

    You could do the following types of requests though:

    1. Search by field (match):

      q=Previous Employer: None&q.options={fields: ['Text']}

    2. Search by field (contains):

      (phrase field=Text 'Previous Employer')

    3. Search by free text (word):

      q=Employer

    4. Search by free text (phrase):

      q="Previous Employer"

    5. Sloppy search (distance between words):

      q="Previous None"~2

    There are more examples of text searches here, but the point I'm trying to make is that you will only be able to do text searches.

    But again the structure of the data from Textract is an issue, because all of these queries would return the same thing:

    "BlockType": "LINE",
          "Confidence": 99.71240997314453,
          "Text": "Previous Employer: None",
          "Geometry": {…}
    

    It wouldn't return the next line in the document, because that exists with the same field names and would technically be a separate document.

    A lot of issues could be resolved if you took the Textract results and created a brand new document (you need to give some thought about how you would exactly manage this workflow).

    Potential document format:

    {
      "id":   "somethingUnique",
      "fields": {
        "title": "FileName",
        "user": ["UserId"],
        "wholeText": "Giant concatenated string of all text from Textract"
        "page1": ["jsonString1","jsonstring2","etc"],
        "page2": ["jsonString1","jsonstring2","etc"],
        "originalDocImage": [bytes]
      }
    }
    

    Where each json string is a single block from Textract results (or just the Text and Geometry).

    That way you could restrict searches by userId, return the users whole document and overlay it with highlighting using the Geometry data.

    It also allows for indexing on the User and allows them to search by the name of the file they uploaded (which could be another index).

    There is a slight pain because doing searches with multiple words that may not be next to each other, would still require you to have a giant string of all text found, and you would need to iterate over the jsonStrings to get the Geometry data to do the highlighting.

    This leaves a lot of processing for searches outside of CloudSearch and all searches would have to be proxied by something like API Gateway/Lambda to ensure people can't do searches on data that is not their own.