I am working on application using Apache UIMA for NLP task about domain specific entity extraction.
The use case is following:
There is Office document or PDF (both scanned, non-scanned) as the input, the application needs to get domain specific data out of it. The document could have free text or/and key-values, tables, pictures
What are the challenges:
Sometimes the original document can contain tables (w/ metadata or w/o). There is no problem to annotate specific standalone token. However, I am looking for some example of building relationships between annotated tokens inside the table (say, it has headers with some business attributes and rows underneath contains the attributes values so I need to create proper relationships as well as to define groups so I can later extract instances of information, say, each row of the table is a one business entity instance compiled of some primitive entities and bounded by relationships).
So there are questions:
Examples:
<table>
<tr>
<th>Name</th>
<th>Favorite Color</th>
</tr>
<tr>
<td>Bob</td>
<td>Yellow</td>
</tr>
<tr>
<td>Michelle</td>
<td>Purple</td>
</tr>
</table>
Name Favorite Color
Bob Yellow
Michelle Purple
Question 1:
In my very subjective opinion, Ruta is very well suited for those tasks, especially if the text processing should be implemented in UIMA. There are countless options to specify this extraction task in Ruta depending on the available annotation and the structure of the tables. Here is an exemplary set of rules building upon the output of the HtmlAnnotator (actually, it's only a single rule):
PACKAGE uima.example;
TYPESYSTEM utils.HtmlTypeSystem;
ENGINE utils.HtmlAnnotator;
EXEC(HtmlAnnotator, {TAG});
ADDRETAINTYPE(WS);
TAG{->TRIM(WS)};
REMOVERETAINTYPE(WS);
DECLARE Relation (Annotation attribute, Annotation value);
BLOCK(tables) TABLE{} {
TR{-CONTAINS(TH)-> CREATE(Relation, "attribute" = a, "value" = v)}
<-{# a:TD v:TD;};
}
Question 2:
You can of course recreate the table structure of question 1 using rules and then apply the same rules. Identifying the table structure strongly depends on the information you have about the tables and what output the text converter produces, e.g., do you know what kind of attributes/values will occur or uses the converter tabs for separating cells. Here's again an exemplary set of rules building upon the output of the PlainTextAnnotator:
PACKAGE uima.example;
TYPESYSTEM utils.PlainTextTypeSystem;
ENGINE utils.PlainTextAnnotator;
EXEC(PlainTextAnnotator, {Line});
ADDRETAINTYPE(WS);
Line{->TRIM(WS)};
Paragraph{->TRIM(WS)};
REMOVERETAINTYPE(WS);
DECLARE Relation (Annotation attribute, Annotation value);
DECLARE Attribute, Value;
DECLARE TextTable, Row;
DECLARE HeaderInd, HeaderLine;
// mock some annotations
"Name" -> HeaderInd;
"Color" -> HeaderInd;
Line{CONTAINS(HeaderInd, 50, 100, true)-> HeaderLine};
Paragraph{STARTSWITH(HeaderLine)-> TextTable};
TextTable->{Line{-PARTOF(HeaderLine)-> Row};};
FOREACH(row) Row{}{
row{CONTAINS(W,2,2)} ->{W{-> Attribute} W{-> Value};};
row{-> CREATE(Relation, "attribute" = Attribute, "value" = Value)};
}
Question 3:
The UIMA Ruta Workbench provides several useful tools of an IDE, which include amongst others profiling and testing.
DISCLAIMER: I am a developer of UIMA Ruta