Search code examples
uimaruta

How should I use UIMA Ruta to match the all words between line break?


Thank for any strong hands!

I have some text like the following

aaaaa aaaa aaaaa aaaaaa
bbbbb bbbbb bbbb bbbbbb
cccccc ccccc ccccc cccccc

I want to use Ruta to create annotation that matches all strings between line break. I want my annotation to create the following three match:

1. aaaaa aaaa aaaaa aaaaaa
2. bbbbb bbbbb bbbb bbbbbb
3. cccccc ccccc ccccc cccccc

I try to match everything between line break, like the following

BREAK #{-> MARK(Stuff)} BREAK;

But no luck. Could anyone please make some suggestion?

Thank you very much!


Solution

  • The problem with your rule is probably the currently used filtering setting. Whitespaces, breaks and markup are not visible by default. The rule is probably not able to find any anchors to start the match process. You need to make breaks visible for the rules, e.g, with RETAINTYPE:

    Document{-> RETAINTYPE(BREAK)};
    BREAK #{-> MARK(Stuff)} BREAK;
    Document{-> RETAINTYPE}; // for restoring the default setting
    

    There is also an analysis engine that is able to create these annotations: PlainTextAnnotator. This analysis engine includes however also whitespaces at the beginning and end of the line. These could be removed with something like:

    Document{-> RETAINTYPE(SPACE)};
    Line{->TRIM(SPACE)};
    

    In UIMA Ruta 2.2.1 (next release) you can also write something like:

    Document{-> RETAINTYPE(BREAK)};
    (#{-> Stuff} BREAK)+;
    

    (I am a developer of UIMA Ruta)