Search code examples
apacheuimaruta

How to extract ID and Date from two substrings using Apache UIMA Ruta?


how can I extract the 2 ID and Date numbers in the following text using Ruta:

ID:1341234
Date:20191021

I tried the following:

RETAINTYPE(WS);
"ID:" n:NUM{-> CREATE(Entity, "label" = "ID", "value"=n.ct)};
"Date:" n:NUM{-> CREATE(Entity, "label" = "Date", "value"=n.ct)};
RETAINTYPE;

Thanks for your help. Philipp


Solution

  • Literal string matches for the matching condition of a rule element rely on the internal indexing of ruta and match only on a single RutaBasic. This means that the actual matching possibly depends on all previously created annotations. Therefore, I would not recommend using literal string matches, or only for rapid prototyping. (Ruta Version 2.7.0, could be changed for later versions)

    For your example, this means that the first rule element does not match because the seeder of the RutaEngine creates separate annotations for words/letters and punctuation marks which leads to two RutaBasic annotations.

    Your rule could work if you rewrite it like:

    RETAINTYPE(WS);
    "ID" ":" n:NUM{-> CREATE(Entity, "label" = "ID", "value"=n.ct)};
    "Date" ":" n:NUM{-> CREATE(Entity, "label" = "Date", "value"=n.ct)};
    RETAINTYPE;
    

    or without literal string matches

    RETAINTYPE(WS);
    CAP.ct=="ID" COLON n:NUM{-> CREATE(Entity, "label" = "ID", "value"=n.ct)};
    W.ct=="Date" COLON n:NUM{-> CREATE(Entity, "label" = "Date", "value"=n.ct)};
    RETAINTYPE;
    

    DISCLAIMER: I am a developer of UIMA Ruta