Search code examples
uimaruta

How can we annotate an unicode character in uima ruta


How can we annotate an unicode character in uima ruta: For Example: I want to mark this text(Paris: Éditions Robert Laffont).So I used the following rule.

DECLARE CITY;
CW COLON CW+{->MARK(CITY,1,3)};

But the text covered upto Paris: Ã. Is there any way to solve this problem. Awaiting for the answer.Thanks in advance.


Solution

  • Its all about he definition of the lexer which creates the token class annotations of ruta (W, CW, SPECIAL ...).

    The rule CW COLON CW+{->MARK(CITY,1,1)}; creates an annotation of the type CITY for the text span Paris regardless of the unicode character.

    The last rule element CW+ matches on à since this is annotated with a CW, but stops there since is not a CW but a SPECIAL.

    There are different ways to avoid this problem. My advice would be that you should rely on a different type of annotation for your rules. The job of the lexer annotations of ruta is to create minimal annotations. They do not define tokens in general.

    You could maybe use something like this (or use an actual tokenizer for better performance):

    DECLARE CITY;
    DECLARE Token;
    
    RETAINTYPE(SPACE);
    (W (SPECIAL? W)*){-> Token};
    RETAINTYPE;
    
    Token COLON Token+{->MARK(CITY,1,1)};
    

    DISCLAIMER: I am a developer of UIMA Ruta