I am trying to extract the below bolded number(AN A348645 PL) through RUTA script. Please look into example I provided:
Below is my code:
Document{->RETAINTYPE(SPACE)};
((W|NUM) (NUM|W|SPACE|SPECIAL)*){REGEXP("([1]{0,1}[A-Z0-9]{2}[\\s ||-]{0,2}[A-Z0-9]{7}[\\s ||-]{0,2}[A-Z]{3})")->MARK(EntityType)};
1)
Input: Claims Experience Report - AN A348645 PLB Nest Holdings Pty Ltd
Expected output: AN A348645 PLB
Original output: No Entity is matched
But, it is working when there is no word/ letter after the pattern:
2)
Input: Claims Experience Report - AN A348645 PLB
Expected Output: AN A348645 PLB
Original output: AN A348645 PLB
In this example
AN A348645 PLB Nest Holdings Pty Ltd
the Star Greedy Quantifier *
, looks for the next annotations after PLB and tries to match them using the given regexp pattern. Therefore, the rule fires only when there are no next tokens to try to match on.
Try to apply the regular expression pattern in Ruta just as it is:
"([1]{0,1}[A-Z0-9]{2}[\\s ||-]{0,2}[A-Z0-9]{7}[\\s ||-]{0,2}[A-Z]{3})"->EntityType;