Search code examples
tokenizestanford-nlp

Using ssplit options for CoreNLP


According to the documentation, I can use options such as ssplit.isOneSentence for parsing my document into sentences. How exactly do I do this though, given a StanfordCoreNLP object?

Here's my code -

Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, depparse");
pipeline.annotate(document);
Annotation document = new Annotation(doc);
pipeline.annotate(document);
List<CoreMap> sentences = document.get(SentencesAnnotation.class);

At what point do I add this option and where? Something like this?

pipeline.ssplit.boundaryTokenRegex = '"' 

I'd also like to know how to use it for the specific option boundaryTokenRegex

EDIT:

I think this seems more appropriate -

props.put("ssplit.boundaryTokenRegex", "/"");

But I still have to verify.


Solution

  • The way to do it for tokenizing sentences to end at any instance of a ' " ' is this -

    props.setProperty("ssplit.boundaryMultiTokenRegex", "/\'\'/");
    

    or

    props.setProperty("ssplit.boundaryMultiTokenRegex", "/\"/");
    

    depending on how it is stored. (CoreNLP normalizes it as the former)

    And if you want both starting and ending quotes -

    props.setProperty("ssplit.boundaryMultiTokenRegex","\/'/'|``\");