Search code examples
uimaruta

Want to Remove Markup's from the Annotation-UIMA RUTA


If I use P tag(from Html Annotator) as PASSAGE.I want to ignore the markup's from the Annotation.

SCRIPT:

//-------------------------------------------------------------------
// SPECIAL SQUARE HYPHEN PARENTHESIS
//-------------------------------------------------------------------
DECLARE LParen, RParen;
SPECIAL{REGEXP("[(]") -> MARK(LParen)};
SPECIAL{REGEXP("[)]") -> MARK(RParen)};

DECLARE LSQParen, RSQParen;
SPECIAL{REGEXP("[\\[]") -> MARK(LSQParen)};
SPECIAL{REGEXP("[\\]]") -> MARK(RSQParen)};

DECLARE LANGLEBRACKET,RANGLEBRACKET;
SPECIAL{REGEXP("<")->MARK(LANGLEBRACKET)};
AMP{REGEXP("&lt;")->MARK(LANGLEBRACKET)};
SPECIAL{REGEXP(">")->MARK(RANGLEBRACKET)};
AMP{REGEXP("&gt;")->MARK(RANGLEBRACKET)};

DECLARE LBracket,RBracket;

(LParen|LSQParen|LANGLEBRACKET){->MARK(LBracket)};
(RParen|RSQParen|RANGLEBRACKET){->MARK(RBracket)};


DECLARE PASSAGE,TESTPASSAGE;

       "<a name=\"para(.+?)\">(.*?)</a>"->2=PASSAGE;

 RETAINTYPE(WS); // or RETAINTYPE(SPACE, BREAK,...);
 PASSAGE{-> TRIM(WS)};
 RETAINTYPE;

  PASSAGE{->MARK(TESTPASSAGE)};



DECLARE TagContent,PassageFirstToken,InitialTag;
LBracket ANY+? RBracket{-PARTOF(TagContent)->MARK(TagContent,1,3)}; 


 BLOCK(foreach)PASSAGE{}
{
Document{->MARKFIRST(PassageFirstToken)};
}   
TagContent{CONTAINS(PassageFirstToken),-PARTOF(InitialTag)->MARK(InitialTag)};


BLOCK(foreach)PASSAGE{}
{
InitialTag  ANY+{->SHIFT(PASSAGE,2,2)};

}

Sample Input:

<p class="Normal"><a name="para1"><h1><b>On the Insert tab, the galleries include items that are designed to coordinate with the overall look of your document. </b></a></p>

<p class="Normal"><a name="para2"><aus>On the Insert tab, the galleries include items that are designed to coordinate with the overall look of your document.</a></p>

<p class="Normal"><a name="para3">On the Insert tab, the galleries include items that are designed to coordinate with the overall look of your document.</a></p>

<p class="Normal"><a name="para4">On the Insert tab, the galleries include items that are designed to coordinate with the overall look of your document. </a></p>

<p class="Normal"><a name="para5">On the Insert tab, the <span>galleries</span> include items that are designed to coordinate with the overall look of your document.</a></p>

PASSAGE(5) AND TESTPASSAGE(2).Why the TESTPASSAGE reduced? And InitialTag is not tagged.

enter image description here I have attached the output annotation image


Solution

  • When reproducing the given example, I get 5 PASSAGE annotations and 3 TESTPASSAGE annotations (the last three PASSAGE annotations). The other two PASSAGE annotations are not annotated with TESTPASSAGE, because they start with a MARKUP annotation, which is not visible by default, and make the complete annotation invisible. In order to avoid this problem, you can make MARKUP visible or trim markups from PASSAGE annotations (is this actually the main question?). Just extend you rules for the TRIM action:

    RETAINTYPE(WS, MARKUP);
    PASSAGE{-> TRIM(WS, MARKUP)};
    RETAINTYPE;
    

    There are no InitialTag annotations because there are no TagContent annotations because there are no LBracket annotations in the example.

    Btw, you could rewrite some rules:

    PASSAGE{->MARKFIRST(PassageFirstToken)};
    
    (LBracket # RBracket){-PARTOF(TagContent)-> TagContent}; 
    

    DISCLAIMER: I am a developer of UIMA Ruta