Search code examples
nlpuimaruta

UIMA, extraction semi-structured (tabular) data out of the text


I am working on application using Apache UIMA for NLP task about domain specific entity extraction.

The use case is following:

There is Office document or PDF (both scanned, non-scanned) as the input, the application needs to get domain specific data out of it. The document could have free text or/and key-values, tables, pictures

What are the challenges:

Sometimes the original document can contain tables (w/ metadata or w/o). There is no problem to annotate specific standalone token. However, I am looking for some example of building relationships between annotated tokens inside the table (say, it has headers with some business attributes and rows underneath contains the attributes values so I need to create proper relationships as well as to define groups so I can later extract instances of information, say, each row of the table is a one business entity instance compiled of some primitive entities and bounded by relationships).

So there are questions:

  1. I am looking for something that is more flexible and human readable in terms of the annotation rules i.e. can I use Ruta in such scenarios when table-form data needs to be annotated? Any rule examples would be very help. The research over this topic did not give much yet.
  2. I am looking for approach how to extract the data if no metadata exist (see below. Would Ruta suites here or anything else? Any examples would be appreciated
  3. I am looking for the tools which will simplify work with annotated text i.e. for profiling, testings purposes. Again, would Ruta solve it?

Examples:

  1. OCR w/ metadata, data after extract stage:
<table> 
    <tr> 
      <th>Name</th> 
      <th>Favorite Color</th> 
    </tr> 
    <tr> 
      <td>Bob</td> 
      <td>Yellow</td> 
    </tr> 
    <tr> 
      <td>Michelle</td> 
      <td>Purple</td> 
    </tr> 
</table>
  1. OCR w/o metadata, data after extract stage:
Name    Favorite Color
Bob Yellow
Michelle    Purple

Solution

  • Question 1:

    In my very subjective opinion, Ruta is very well suited for those tasks, especially if the text processing should be implemented in UIMA. There are countless options to specify this extraction task in Ruta depending on the available annotation and the structure of the tables. Here is an exemplary set of rules building upon the output of the HtmlAnnotator (actually, it's only a single rule):

    PACKAGE uima.example;
    
    TYPESYSTEM utils.HtmlTypeSystem;
    
    ENGINE utils.HtmlAnnotator;
    
    EXEC(HtmlAnnotator, {TAG});
    
    ADDRETAINTYPE(WS);
    TAG{->TRIM(WS)};
    REMOVERETAINTYPE(WS);
    
    DECLARE Relation (Annotation attribute, Annotation value);
    
    BLOCK(tables) TABLE{} {
        TR{-CONTAINS(TH)-> CREATE(Relation, "attribute" = a, "value" = v)}
            <-{# a:TD v:TD;};
    }
    

    Question 2:

    You can of course recreate the table structure of question 1 using rules and then apply the same rules. Identifying the table structure strongly depends on the information you have about the tables and what output the text converter produces, e.g., do you know what kind of attributes/values will occur or uses the converter tabs for separating cells. Here's again an exemplary set of rules building upon the output of the PlainTextAnnotator:

    PACKAGE uima.example;
    
    TYPESYSTEM utils.PlainTextTypeSystem;
    
    ENGINE utils.PlainTextAnnotator;
    
    EXEC(PlainTextAnnotator, {Line});
    
    ADDRETAINTYPE(WS);
    Line{->TRIM(WS)};
    Paragraph{->TRIM(WS)};
    REMOVERETAINTYPE(WS);
    
    DECLARE Relation (Annotation attribute, Annotation value);
    DECLARE Attribute, Value;
    
    DECLARE TextTable, Row;
    DECLARE HeaderInd, HeaderLine;
    
    // mock some annotations
    "Name" -> HeaderInd;
    "Color" -> HeaderInd;
    
    Line{CONTAINS(HeaderInd, 50, 100, true)-> HeaderLine};
    
    Paragraph{STARTSWITH(HeaderLine)-> TextTable};
    TextTable->{Line{-PARTOF(HeaderLine)-> Row};};
    
    FOREACH(row) Row{}{
    
        row{CONTAINS(W,2,2)} ->{W{-> Attribute} W{-> Value};};
        row{-> CREATE(Relation, "attribute" = Attribute, "value" = Value)};
    }
    

    Question 3:

    The UIMA Ruta Workbench provides several useful tools of an IDE, which include amongst others profiling and testing.

    DISCLAIMER: I am a developer of UIMA Ruta