Search code examples
nlpstanford-nlp

How can I stop Stanford CoreNLP from segmenting my sentence


I have segmented resources and resources that match my segmented sentences.

How can I stop Stanford CoreNLP from segmenting my sentence before generating the parsing tree?

I am doing works on Chinese.


Solution

  • Your description is not very precise, so I'm not sure if I interpret your question correctly. It sounds like you want to feed the parser a list of tokens without having corenlp doing any tokenisation, right? If so, it would be useful to know which parser you are using. But with both, you can just feed it a list of tokens and corenlp will not jump in and mess up your tokenisation. I haven't worked with the chinese resources, but the following could help you (if you have done tokenisation before already, and splitting on whitespace results in proper tokenisation):

        String sentence = "I can't do that .";
        ArrayList<HasWord> hwl = new ArrayList<HasWord>();
        String[] tokens = sentence.split(" ");
        for (String t : tokens){
         HasWord hw = new Word();
         hw.setWord(t);
         hwl.add(hw);
        }
        LexicalizedParser lexParser = LexicalizedParser.loadModel("<path to chinese lex parsing here>","-maxLength", "70");
        Tree cTree = lexParser.parse(hwl);
        System.out.println("c tree:" + cTree);
    
    
        DependencyParser parser = DependencyParser.loadFromModelFile("<chinese model for dep parsing here>");
        MaxentTagger tagger = new MaxentTagger("<path to your tagger file goes here");
        List<TaggedWord> tagged = tagger.tagSentence(hwl);
        GrammaticalStructure gs = parser.predict(tagged);
        System.out.println("dep tree:" + gs.typedDependencies());
    

    Deleting the stderr lines that are written, this results in:

    c tree:(ROOT (S (MPN (FM I) (FM can't)) (VVFIN do) (ADJD that) ($. .)))
    dep tree:[nsubj(can't-2, I-1), root(ROOT-0, can't-2), xcomp(can't-2, do-3), dobj(do-3, that-4), punct(can't-2, .-5)]
    

    hope this helps.