Search code examples
apachenlpuimaruta

Annotate Data in between Markup


I'm trying to write a rule to detect Data in between Markup tags.

Input data format is fixed for example

<1> Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa quis enim</1>
<2>  nascetur ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa quis enim</2>

What i basically need here to detect data in between start and end tag in my case output should be 1 and 2

I'm trying below rule.

 Document{->ADDRETAINTYPE(MARKUP)};

STRING sStart = "<";
STRING sEnd = ">";
DECLARE spanStart;
DECLARE spanEnd;

DECLARE ZONE;
sStart -> spanStart;
sEnd -> spanEnd;

spanStart NUM spanEnd{->MARK(ZONE,2)}; 

But value is not getting detected as 1 & 2 are not detected as number


Solution

  • "1" and "2" are not detected as NUM because they are MARKUP. The seeding creates a disjunct non-overlapping partitioning of the document. If you want to create an annotation within a currently smallest part, e.g., in your use case MARKUP, you can do that with a simple regex rule as you did in your question with spanStart and spanEnd.

    I would use something like:

    MARKUP->{"\\d+"-> ZONE;};
    

    or

    MARKUP->{"</?(\\d+)>"-> 1 = ZONE;};
    

    DISCLAIMER: I am a developer of UIMA Ruta