Search code examples
stanford-nlptokenizepenn-treebank

Entities containing underscore character are split into multiple entities by TokensAnnotation in CoreNLP


I am observing that coreNLP 3.9.2 has started splitting enti_ties into multiple ones like 'enti' , '_', 'ties' while tokenizing

I have tried to use the tokenize.whitespace which solves this problem. But I think this will stop splitting tokens for "cant't" and "dont't"


Solution

  • One thing you can do is replace the underscores (_) with a period (.) and the parser (and tokenizer, I believe) will interpret it as one entity.

    E.g. enti_ties > enti.ties where the latter is retained as one entity

    This doesn't entirely resolve the problem, but serves as a workaround in a pinch.