Search code examples
javamachine-learningstanford-nlp

Can someone explain how to create a PTB Dataset And/Or Train my own model using StanfordNLP?


I'm learning about sentiment analysis and I can't seem to find anything online that outlines how to create a PTB Dataset. I'm using StanfordNLP with Java. I've downloaded the test, dev and validate data that they used and I can't get my head around how these have been outlined:

test.txt:

(3 (2 (2 The) (2 Rock)) (4 (3 (2 is) (4 (2 destined) (2 (2 (2 (2 (2 to) (2 (2 be) (2 (2 the) (2 (2 21st) (2 (2 (2 Century) (2 's)) (2 (3 new) (2 (2 ``) (2 Conan)))))))) (2 '')) (2 and)) (3 (2 that) (3 (2 he) (3 (2 's) (3 (2 going) (3 (2 to) (4 (3 (2 make) (3 (3 (2 a) (3 splash)) (2 (2 even) (3 greater)))) (2 (2 than) (2 (2 (2 (2 (1 (2 Arnold) (2 Schwarzenegger)) (2 ,)) (2 (2 Jean-Claud) (2 (2 Van) (2 Damme)))) (2 or)) (2 (2 Steven) (2 Segal))))))))))))) (2 .)))

I figure that numbers are aligned to sentiment value but I'm still not sure how it works.

TLDR; I'm trying to develop my own model for news analysis and have seen that the StanfordNLP model has been trained on movie reviews which is leading to poor sentiment analysis so, I thought to attempt to develop my own but I can't find anything online that teaches what each element is or how to even do this.

At best; outlined on this page: https://nlp.stanford.edu/sentiment/code.html

Is the dataset available and the code to train.

Models can be retrained using the following command using the PTB format dataset:

java -mx8g edu.stanford.nlp.sentiment.SentimentTraining -numHid 25 -trainPath train.txt -devPath dev.txt -train -model model.ser.gz

I have the data that I need to parse ready.


Solution

  • Okay.. So I've done some more digging and have started to finally understand (some what) as how to create a Dataset Tree and will try to break it down for anyone who stumbles upon this post with the same troubles as I've been having.

    Step 1.

    • Find your data. (In my case it's news articles about the UK housing market)
    UK renters: are you living with someone you’ve fallen out with?
    UK property asking prices stagnating, lifting hopes of softer landing for housing market
    

    Step 2.

    • Annotate your data
    2 UK renters: are you living with someone you’ve fallen out with?
    1 fallen out with
    1 fallen out
    2 UK renters
    2 living with someone
    3 fallen
    2 :
    2 ?
    2 living with
    2 someone
    
    3 UK property asking prices stagnating, lifting hopes of softer landing for housing market
    2 UK property
    3 asking prices stagnating
    2 asking prices
    4 lifting hopes
    2 hopes
    4 lifting hopes of softer landing
    3 softer landing for housing market
    2 housing market
    2 lifting
    2 landing
    2 , 
    

    Annotation Meanings

    Very Positive= 4
    Positive = 3
    Neutral = 2
    Negative = 1
    Very Negative = 0
    

    Structure

    2 UK renters: are you living with someone you’ve fallen out with?
       //Overall sentiment
    
    1 fallen out with
       // Negative
    
    1 fallen out
       // Negative
    
    2 UK renters
       // Neutral
    
    ...etc..
    
    • Save the annotated data to a .txt (sample.txt)

    Step 3:

    • Locate your stanford-corenlp-4.5.2.jar

      • example ~/.m2/repository/edu/stanford/nlp/stanford-corenlp/4.5.2

    Step 4:

    • Open Bash and run
      • java -cp "*" -mx5g edu.stanford.nlp.sentiment.BuildBinarizedDataset -input /c/Users/rusku/Desktop/StanfordNPL/rusSample/sample.txt
      • replace the above data location

    Step 5:

    • Result
    (2 (2 (2 (2 UK) (2 renters)) (2 :)) (2 (2 (2 (2 are) (2 you)) (2 (2 living) (2 (2 with) (2 (2 someone) (2 (2 you) (2 (2 ▒ve) (1 (1 (3 fallen) (2 out)) (2 with)))))))) (2 ?)))
    (3 (3 (2 (3 UK) (3 property)) (2 (3 asking) (3 prices))) (3 (3 (3 stagnating) (3 (2 ,) (4 (2 lifting) (2 hopes)))) (3 (3 of) (3 (3 (3 softer) (2 landing)) (3 (3 for) (2 (3 housing) (3 market)))))))
    

    Resource: Train Stanford CoreNLP about the sentiment of domain-specific phrases

    This is as far as I've currently gotten.

    Hope this helps.