According to the documentation, I can use options such as ssplit.isOneSentence for parsing my document into sentences. How exactly do I do this though, given a StanfordCoreNLP object?
Here's my code -
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, depparse");
pipeline.annotate(document);
Annotation document = new Annotation(doc);
pipeline.annotate(document);
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
At what point do I add this option and where? Something like this?
pipeline.ssplit.boundaryTokenRegex = '"'
I'd also like to know how to use it for the specific option boundaryTokenRegex
EDIT:
I think this seems more appropriate -
props.put("ssplit.boundaryTokenRegex", "/"");
But I still have to verify.
The way to do it for tokenizing sentences to end at any instance of a ' " ' is this -
props.setProperty("ssplit.boundaryMultiTokenRegex", "/\'\'/");
or
props.setProperty("ssplit.boundaryMultiTokenRegex", "/\"/");
depending on how it is stored. (CoreNLP normalizes it as the former)
And if you want both starting and ending quotes -
props.setProperty("ssplit.boundaryMultiTokenRegex","\/'/'|``\");