Search code examples
uimaruta

How to retrieve compound words from string list- UIMA RUTA


Sample Script:

DECLARE Name,TEST;

 "Peter"->Name;
 "der Groot"->Name;
 "Robert"->Name;
 "de Leew"->Name;
 "O'Sullivan"->Name;

STRING s;
STRINGLIST slist;
Name{-> MATCHEDTEXT(s), ADD(slist,s),LOG(s)};
  ANY+ {INLIST(slist)->MARK(TEST)};

Received Output:

Peter
Robert

Expected Output:

 Peter
 der Groot
 Robert
 de Leew
 O'Sullivan

Sample Input:

Peter
der Groot
Robert
de Leew
O'Sullivan

I've tried to mark the stringlist value into an annotation type.But the received output is different from expected output.


Solution

  • The condition at the rule element ANY+ validates every single ANY, thus fails with the first one and also matches only single tokens.

    Should the last rule annotate only position directly after Name annotations?

    If not, the you can do something like:

    Name{-> MATCHEDTEXT(s), ADD(slist,s)};
    MARKFAST(TEST, slist);
    

    If yes, the situation gets more complicated because you do not have candidates with the correct span. You cannot solve this with a combination of ANY and INLIST, You either need a correct span or fragments in the list. I'd rather recommend an additional fixing rule:

    Name{-> MATCHEDTEXT(s), ADD(slist,s)};
    MARKFAST(TEST, slist);
    ANY{-ENDSWITH(Name)} @TEST{-> UNMARK(TEST)};
    

    DISCLAIMER: I am a developer of UIMA Ruta