Search code examples
uimaruta

How to match specific tokens in UIMA Ruta?


DECLARE A,B;
DECLARE Annotation C(Annotation firstA, Annotation secondA,...);
"token1|token2|...|tokenn" -> A;
"token3|token4" -> B;

A A B {->MARK(C,1,3)}; 

I did with GATHER

(A  COMMA A B) {-> GATHER(C,1,4,"firstA"=1,"secondA" = 3,"B"=4)};

But how about in case of unknown sequence of A type? as below, how can it be possible to store all A in features? The number of features are also unknown. In plan java, we declare String array and can add element, but in Ruta seems to be no as such process.

(A  (COMMA A)+ B) {-PARTOF(C) -> GATHER(C,beginPosition,endPosition,"firstA"=1,"secondA" = 3,"thirdA"=?,so on)};

Solution

  • Annotations in UIMA refer to the complete span, from the begin offset to the end offset. So, if you want to specify something with two elements, then a simple annotation is not sufficient. You cannot create an annotation of the type C that covers the first A and the B but not the second A.

    However, you can store the important information in feature values. How to implement it depends on various things.

    If there are always exactly two annotation you want to remember, then add two features to type C and assign the feature values in the given rule, e.g., by CREATE(C,1,3,"first" = A, "second" = B ).

    You can also use different actions like GATHER, or use one FSArray feature in order to store the annotations.

    A complete example with a FSArray:

    DECLARE A, B;
    DECLARE Annotation C (FSArray as, B b);
    "A" -> A;
    "B" -> B;
    (A (COMMA A)+ B){-PARTOF(C) -> CREATE(C, "as" = A, "b" = B)};
    

    If applied on a text like "A, A, A B", the last rule creates one annotation of type C that stores three A annotations in the feature "as" and one B annotation in the feature "b"