Search code examples
nlpstanford-nlp

How can I effectively build a sentiment model training dataset using Stanford CoreNLP?


I’m interested in training a new sentiment model with my own dataset. I know that I need to create a file with sentiment labeled for sentences and their component phrases and words.

I figured out how to create a tree like the following for the sentence “I do not love you.” via the BuildBinarizedDataset:

(1 (1 I) (1 (1 (1 (1 do) (1 not)) (1 (1 love) (1 you))) (1 .)))

However, this seems terribly difficult to add labels manually in this format, particularly for phrases within a longer sentence. It would be far easier if I could generate the following for labeling purposes, then convert when I am ready to train the new model.

sentiment_score pline1

sentiment_score  phrase1

sentiment_score  phrase2

...........................

sentiment_score  phraseN

BLANK ROW

sentiment_score pline2

The problem is that I can’t figure out how to generate this from a sentence with the parser. If someone could provide guidance, or direct me to documentation that will explain this process, it would help me tremendously.


Solution

  • Here is some sample code I wrote to go through a tree and print out every subtree. So to get the print out you want just use the printSubTrees method I wrote and have it print out everything in your sentiment tree.

    import edu.stanford.nlp.ling.CoreAnnotations;
    import edu.stanford.nlp.ling.Word;
    import edu.stanford.nlp.parser.lexparser.LexicalizedParser;
    import edu.stanford.nlp.parser.lexparser.TreeBinarizer;
    import edu.stanford.nlp.pipeline.Annotation;
    import edu.stanford.nlp.pipeline.StanfordCoreNLP;
    import edu.stanford.nlp.trees.*;
    
    import java.io.IOException;
    import java.util.ArrayList;
    import java.util.Properties;
    
    public class SubTreesExample {
    
        public static void printSubTrees(Tree inputTree) {
            ArrayList<Word> words = new ArrayList<Word>();
            for (Tree leaf : inputTree.getLeaves()) {
                words.addAll(leaf.yieldWords());
            }
            System.out.print(inputTree.label()+"\t");
            for (Word w : words) {
                System.out.print(w.word()+ " ");
            }
            System.out.println();
            for (Tree subTree : inputTree.children()) {
                printSubTrees(subTree);
            }
        }
    
        public static void main(String[] args) {
            Properties props = new Properties();
            props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,parse");
            StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
            String text = "I do not love you.";
            Annotation annotation = new Annotation(text);
            pipeline.annotate(annotation);
            Tree sentenceTree = annotation.get(CoreAnnotations.SentencesAnnotation.class).get(0).get(
                    TreeCoreAnnotations.TreeAnnotation.class);
            printSubTrees(sentenceTree);
    
        }
    }