Search code examples
stanford-nlp

TokensRegex: Tokens are null after retokenization


I'm experimenting with Stanford NLP's TokensRegex and try to find dimensions (e.g. 100x120) in a text. So my plan is to first retokenize the input to further split these tokens (using the example provided in retokenize.rules.txt) and then to search for the new pattern.

After doing the retokenization, however, only null-values are left that replace the original string:

The top level annotation
[Text=100x120 Tokens=[null-1, null-2, null-3] Sentences=[100x120]]

The retokenization seems to work fine (3 tokens in result), but the values are lost. What can I do to maintain the original values in the tokens list?

My retokenize.rules.txt file is (as in the demo):

tokens = { type: "CLASS", value:"edu.stanford.nlp.ling.CoreAnnotations$TokensAnnotation" }
options.matchedExpressionsAnnotationKey = tokens;
options.extractWithTokens = TRUE;
options.flatten = TRUE;
ENV.defaults["ruleType"] = "tokens"
ENV.defaultStringPatternFlags = 2
ENV.defaultResultAnnotationKey = tokens

{ pattern: ( /\d+(x|X)\d+/ ), result: Split($0[0], /x|X/, TRUE) }

The main method:

public static void main(String[] args) throws IOException {
    //...
    text = "100x120";
    Properties properties = new Properties();
    properties.setProperty("tokenize.language", "de");
    properties.setProperty("annotators", tokenize,retokenize,ssplit,pos,lemma,ner");
    properties.setProperty("customAnnotatorClass.retokenize", "edu.stanford.nlp.pipeline.TokensRegexAnnotator");
    properties.setProperty("retokenize.rules", "retokenize.rules.txt");
    StanfordCoreNLP stanfordPipeline = new StanfordCoreNLP(properties);
    runPipeline(pipelineWithRetokenize, text);

}

And the pipeline:

public static void runPipeline(StanfordCoreNLP pipeline, String text) {
    Annotation annotation = new Annotation(text);
    pipeline.annotate(annotation);
    out.println();
    out.println("The top level annotation");
    out.println(annotation.toShorterString());
    //...
}

Solution

  • Thanks for letting us know. The CoreAnnotations.ValueAnnotation is not being populated and we'll update TokenRegex to populate the field.

    Regardless, you should be able to use TokenRegex to retokenize as you have planned. Most of the pipeline does not depending on the ValueAnnotation and uses the CoreAnnotations.TextAnnotation instead. You can use the CoreAnnotations.TextAnnotation to get the text for the new tokens (each token is a CoreLabel so you can access it using token.word() as well).

    See TokensRegexRetokenizeDemo for example code on how to get the different annotations out.