I'm using Stanford CoreNlp tool for tokenizing the text in a way that the introduced offset of each token is very important (I need the offset of each token to use it later in Brat). The related part of my program is as follow:
pipeline.annotate(annotation);
List<CoreMap> sentences =annotation.get(CoreAnnotations.SentencesAnnotation.class);
if (sentences != null && !sentences.isEmpty()) {
for (CoreMap sentence : sentences) {
// CoreMap sentence = sentences.get(0);
for (CoreMap token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
// out.println(token+"\t"+token.get(NamedEntityTagAnnotation.class));
words = token + "\t" + token.get(NamedEntityTagAnnotation.class);
String word_offset = token.toShorterString().toString();
wordsId.add(words);
wordsId1.add(words.substring(0, words.indexOf("-")).trim());
wordsId2.add(word_offset);
System.Out.Println("Text_woffset.txt",word_offset+"\n" );
}
Input = "D: Great!
CM: How are you, Daniella? {BR}
{NS}
D: I'm doing good, except for the fact that I'm hearing a little bit of echo.
CM: Oh. {LG} Darn.
D: Give me a second.
CM: Okay."
I use the following code to read the input:
Text = new Scanner(new File(Input)).useDelimiter("\\A").next();
With this input I get a wrong offset. For example for the token "Daniella" the offset should be [28 36] but the tool shows me [27, 35] or in the middle of the text the token got 10 to 30 wrong offsets. Would you please let me know the way to cope with such a conversational text using tokenizer? I put the actual text as input (to ensure that the problem is not for using Scanner) but the problem remains the same.
What you want is the CharacterOffsetBegin and CharacterOffsetEnd annotations attached to each token. A shorthand for this is CoreLabel.begin()
and CoreLabel.end()
. A minor tweak to your code: the tokens can be CoreLabel
s (a subclass of CoreMap
) -- the CoreLabel
class has a bunch of utility methods that make working with them much easier.
As a general rule, while in the class hierarchy both CoreLabel and Annotation are subclasses of CoreMap, semantically an Annotation is almost always a document, a CoreMap is almost always a sentence, and a CoreLabel is almost always a token.