I got a different result when I retrained the sentiment model with Stanford CoreNLP to compare with the related paper's result

I downloaded stanford-corenlp-full-2015-12-09. And I created a training model with the following command:

 java -mx8g edu.stanford.nlp.sentiment.SentimentTraining -numHid 25 -trainPath train.txt -devPath dev.txt -train -model model.ser.gz

When I finished training, I found many files in my directory. the model list

Then I used the evaluation tool from the package and I ran the code like this:

java -cp * edu.stanford.nlp.sentiment.Evaluate -model model-0024-79.82.ser.gz -treebank test.txt

The test.txt was from trainDevTestTrees_PTB.zip. This is the result about code:

F:\trainDevTestTrees_PTB\trees>java -cp * edu.stanford.nlp.sentiment.Evaluate -model model-0024-79.82.ser.gz -treebank test.txt
EVALUATION SUMMARY
Tested 82600 labels
65331 correct
17269 incorrect
0.790932 accuracy
Tested 2210 roots
890 correct
1320 incorrect
0.402715 accuracy
Label confusion matrix
  Guess/Gold       0       1       2       3       4    Marg. (Guess)
           0     551     340      87      32       6    1016
           1     956    5348    2476     686     191    9657
           2     354    2812   51386    3097     467   58116
           3     146     744    2525    6804    1885   12104
           4       1      11      74     379    1242    1707
Marg. (Gold)    2008    9255   56548   10998    3791

           0        prec=0.54232, recall=0.2744, spec=0.99423, f1=0.36442
           1        prec=0.5538, recall=0.57785, spec=0.94125, f1=0.56557
           2        prec=0.8842, recall=0.90871, spec=0.74167, f1=0.89629
           3        prec=0.56213, recall=0.61866, spec=0.92598, f1=0.58904
           4        prec=0.72759, recall=0.32762, spec=0.9941, f1=0.4518

Root label confusion matrix
  Guess/Gold       0       1       2       3       4    Marg. (Guess)
           0      50      60      12       9       3     134
           1     161     370     147      94      36     808
           2      31     103     102      60      32     328
           3      36      97     123     305     265     826
           4       1       3       5      42      63     114
Marg. (Gold)     279     633     389     510     399

           0        prec=0.37313, recall=0.17921, spec=0.9565, f1=0.24213
           1        prec=0.45792, recall=0.58452, spec=0.72226, f1=0.51353
           2        prec=0.31098, recall=0.26221, spec=0.87589, f1=0.28452
           3        prec=0.36925, recall=0.59804, spec=0.69353, f1=0.45659
           4        prec=0.55263, recall=0.15789, spec=0.97184, f1=0.24561

Approximate Negative label accuracy: 0.638817
Approximate Positive label accuracy: 0.697140
Combined approximate label accuracy: 0.671925
Approximate Negative root label accuracy: 0.702851
Approximate Positive root label accuracy: 0.742574
Combined approximate root label accuracy: 0.722680

The accuracy about fine-grained and positive/negative was quite different from the paper "Socher, R., Perelygin, A., Wu, J.Y., Chuang, J., Manning, C.D., Ng, A.Y. and Potts, C., 2013, October. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the conference on empirical methods in natural language processing (EMNLP) (Vol. 1631, p. 1642)." The paper states the accuracy about fine-grained and positive/negative is higher than mine. The records in the paper

Were there any problems with my operation? Why was my result different from the paper?

Solution

The short answer is that the paper used a different system written in Matlab. The Java system does not match the paper. Though we do distribute the binary model we trained in Matlab with the English models jar. So you can RUN the binary model with Stanford CoreNLP, but you cannot TRAIN a binary model with similar performance with Stanford CoreNLP at this time.