Search code examples
javanlpstanford-nlp

finding features from a large data set by stanford corenlp


I am new Stanford NLP. I can not find any good and complete documentation or tutorial. My work is to do sentiment analysis. I have a very large dataset of product reviews. I already distinguished them by positive and negative according to "starts" given by the users. Now I need to find the most occurred positive and negative adjectives as the features of my algorithm. I understand how to do tokenzation, lemmatization and POS taging from here. I got files like this.

The review was

Don't waste your money. This is a short DVD and the host is boring and offers information that is common sense to any idiot. Pass on this and buy something else. Very generic

and the output was.

Sentence #1 (6 tokens):
Don't waste your money.
[Text=Do CharacterOffsetBegin=0 CharacterOffsetEnd=2 PartOfSpeech=VBP Lemma=do]
[Text=n't CharacterOffsetBegin=2 CharacterOffsetEnd=5 PartOfSpeech=RB Lemma=not]
[Text=waste CharacterOffsetBegin=6 CharacterOffsetEnd=11 PartOfSpeech=VB Lemma=waste]
[Text=your CharacterOffsetBegin=12 CharacterOffsetEnd=16 PartOfSpeech=PRP$ Lemma=you]
[Text=money CharacterOffsetBegin=17 CharacterOffsetEnd=22 PartOfSpeech=NN Lemma=money]
[Text=. CharacterOffsetBegin=22 CharacterOffsetEnd=23 PartOfSpeech=. Lemma=.]
Sentence #2 (21 tokens):
This is a short DVD and the host is boring and offers information that is common sense to any idiot.
[Text=This CharacterOffsetBegin=24 CharacterOffsetEnd=28 PartOfSpeech=DT Lemma=this]
[Text=is CharacterOffsetBegin=29 CharacterOffsetEnd=31 PartOfSpeech=VBZ Lemma=be]
[Text=a CharacterOffsetBegin=32 CharacterOffsetEnd=33 PartOfSpeech=DT Lemma=a]
[Text=short CharacterOffsetBegin=34 CharacterOffsetEnd=39 PartOfSpeech=JJ Lemma=short]
[Text=DVD CharacterOffsetBegin=40 CharacterOffsetEnd=43 PartOfSpeech=NN Lemma=dvd]
[Text=and CharacterOffsetBegin=44 CharacterOffsetEnd=47 PartOfSpeech=CC Lemma=and]
[Text=the CharacterOffsetBegin=48 CharacterOffsetEnd=51 PartOfSpeech=DT Lemma=the]
[Text=host CharacterOffsetBegin=52 CharacterOffsetEnd=56 PartOfSpeech=NN Lemma=host]
[Text=is CharacterOffsetBegin=57 CharacterOffsetEnd=59 PartOfSpeech=VBZ Lemma=be]
[Text=boring CharacterOffsetBegin=60 CharacterOffsetEnd=66 PartOfSpeech=JJ Lemma=boring]
[Text=and CharacterOffsetBegin=67 CharacterOffsetEnd=70 PartOfSpeech=CC Lemma=and]
[Text=offers CharacterOffsetBegin=71 CharacterOffsetEnd=77 PartOfSpeech=VBZ Lemma=offer]
[Text=information CharacterOffsetBegin=78 CharacterOffsetEnd=89 PartOfSpeech=NN Lemma=information]
[Text=that CharacterOffsetBegin=90 CharacterOffsetEnd=94 PartOfSpeech=WDT Lemma=that]
[Text=is CharacterOffsetBegin=95 CharacterOffsetEnd=97 PartOfSpeech=VBZ Lemma=be]
[Text=common CharacterOffsetBegin=98 CharacterOffsetEnd=104 PartOfSpeech=JJ Lemma=common]
[Text=sense CharacterOffsetBegin=105 CharacterOffsetEnd=110 PartOfSpeech=NN Lemma=sense]
[Text=to CharacterOffsetBegin=111 CharacterOffsetEnd=113 PartOfSpeech=TO Lemma=to]
[Text=any CharacterOffsetBegin=114 CharacterOffsetEnd=117 PartOfSpeech=DT Lemma=any]
[Text=idiot CharacterOffsetBegin=118 CharacterOffsetEnd=123 PartOfSpeech=NN Lemma=idiot]
[Text=. CharacterOffsetBegin=123 CharacterOffsetEnd=124 PartOfSpeech=. Lemma=.]
Sentence #3 (8 tokens):
Pass on this and buy something else.
[Text=Pass CharacterOffsetBegin=125 CharacterOffsetEnd=129 PartOfSpeech=VB Lemma=pass]
[Text=on CharacterOffsetBegin=130 CharacterOffsetEnd=132 PartOfSpeech=IN Lemma=on]
[Text=this CharacterOffsetBegin=133 CharacterOffsetEnd=137 PartOfSpeech=DT Lemma=this]
[Text=and CharacterOffsetBegin=138 CharacterOffsetEnd=141 PartOfSpeech=CC Lemma=and]
[Text=buy CharacterOffsetBegin=142 CharacterOffsetEnd=145 PartOfSpeech=VB Lemma=buy]
[Text=something CharacterOffsetBegin=146 CharacterOffsetEnd=155 PartOfSpeech=NN Lemma=something]
[Text=else CharacterOffsetBegin=156 CharacterOffsetEnd=160 PartOfSpeech=RB Lemma=else]
[Text=. CharacterOffsetBegin=160 CharacterOffsetEnd=161 PartOfSpeech=. Lemma=.]
Sentence #4 (2 tokens):
Very generic
[Text=Very CharacterOffsetBegin=162 CharacterOffsetEnd=166 PartOfSpeech=RB Lemma=very]
[Text=generic CharacterOffsetBegin=167 CharacterOffsetEnd=174 PartOfSpeech=JJ Lemma=generic]

I already have processed 10000 positive and 10000 negative file like this. Now How can I easily find the most occurred positive and negative features(adjectives)? Do i need to read all the output(processed) file and make a list frequency count of the adjectives like this or is there any easy way by stanford corenlp?


Solution

  • Here is an example of processing an annotated review and storing the adjectives in a Counter.

    In the example the movie review "The movie was great! It was a great film." has a sentiment of "positive".

    I would suggest altering my code to load in each file and build an Annotation with the file's text and recording the sentiment for that file.

    Then you can go through each file and build up a Counter with positive and negative counts for each adjective.

    The final Counter has the adjective "great" with a count of 2.

    import edu.stanford.nlp.ling.CoreAnnotations;
    import edu.stanford.nlp.ling.CoreLabel;
    import edu.stanford.nlp.pipeline.Annotation;
    import edu.stanford.nlp.pipeline.StanfordCoreNLP;
    import edu.stanford.nlp.stats.Counter;
    import edu.stanford.nlp.stats.ClassicCounter;
    import edu.stanford.nlp.util.CoreMap;
    
    import java.util.Properties;
    
    public class AdjectiveSentimentExample {
    
        public static void main(String[] args) throws Exception {
    
            Counter<String> adjectivePositiveCounts = new ClassicCounter<String>();
            Counter<String> adjectiveNegativeCounts = new ClassicCounter<String>();
    
            Annotation review = new Annotation("The movie was great!  It was a great film.");
            String sentiment = "positive";
    
            Properties props = new Properties();
            props.setProperty("annotators", "tokenize, ssplit, pos, lemma");
            StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
            pipeline.annotate(review);
            for (CoreMap sentence : review.get(CoreAnnotations.SentencesAnnotation.class)) {
                for (CoreLabel cl : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
                    if (cl.get(CoreAnnotations.PartOfSpeechAnnotation.class).equals("JJ")) {
                        if (sentiment.equals("positive")) {
                            adjectivePositiveCounts.incrementCount(cl.word());
                        } else if (sentiment.equals("negative")) {
                            adjectiveNegativeCounts.incrementCount(cl.word());
                        }
                    }
    
                }
            }
    
            System.out.println("---");
            System.out.println("positive adjective counts");
            System.out.println(adjectivePositiveCounts);
        }
    }