How to generalize special entities

We use Apache UIMA Ruta for processing our documents. The input documents contains all kind of patterns that we try to recognize and translate to a hierarchy of annotations.

One of the things we will do with the result is to decorate the input text with links. For that it's import that we know the original position information of the found annotations.

Some of the annotations are based on value lists. We use MarkTable to resolve them.

The problem we have is that input document can contain different kind of special entities. For example, the document can contain also words that contain entities like &   𝌆. These can also exist in words / sentences that will be looked up into valuelists.

We are searching for an option to generalize (convert) all that kind of options to a normal "plain text" format, so that we don't have to add all kind of options, with special entities to the valuelists.

Doing a pre-processing of the document and replace them all (for example with the HTMLConverter engine) is AFAIK not a good option, because that will also change the position info. & should match on &, but still seen as size 5.

I tried to use the replace action, that will add an extra "replacement" attribute to the annotation. When I add an interceptor (aspect) to the getCoveredText of the annotation class, and return replacement instead of real text if available, the matching will succeed. But this give problems if the replacement text contains spacers (the end position is still equal with the original text / first RutBasic).

Any suggestions how we can solve this?

Solution

I solved this issue by building a pre- and post processor for the content.

In the pre-processor I replace text fragments with other text. For example the & and &AMP; will be replaced by a normal &. While preprocessing I store each replacement details in an replacement object, that will be added to an ordered list. A replacement object contains the original text and the difference in length (& is 4 characters longer than a single &).

After annotating with RUTA(and other annotators) I correct all the found annotation values (text) to the original value and I fix the position information (begin and end) of the annotations, so that they match with the original content. I use the list with replacement details for this process.