Search code examples
datetimestanford-nlpsutime

Unexpected Date / DateTime Strings cause exception in Stanford CoreNLP


According to CoreNLP's Git, the issue has been fixed in some version of CoreNLP, possibly 3.5.1 according to my guess since NER is listed as one of the changed modules in the change notes. However, 3.5.x requires the jump to Java 1.8 and we are not prepared to do so at the current time.

Also, disclaimer, I did post to that issue as well, but it may not been seen because the issue has been resolved. Given that SO is an official forum for support for CoreNLP, I ask here.

So I am asking, what is the change to fix this? Does it in fact exist in a current version, or is there something else that needs to be done. I need to fix this without upgrading from the 3.4.1 that I am currently using.

For the record, the string below is supposed to represent Dec 3, 2009 at 10:00 (no seconds are given in that string, so we assume 00 as well).

Here is the stack trace.

java.lang.NumberFormatException: For input string: "200912031000"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:583)
at java.lang.Integer.valueOf(Integer.java:766)
at edu.stanford.nlp.ie.pascal.ISODateInstance.extractDay(ISODateInstance.java:1107)
at edu.stanford.nlp.ie.pascal.ISODateInstance.extractFields(ISODateInstance.java:398)
at edu.stanford.nlp.ie.pascal.ISODateInstance.<init>(ISODateInstance.java:82)
at edu.stanford.nlp.ie.QuantifiableEntityNormalizer.normalizedDateString(QuantifiableEntityNormalizer.java:363)
at edu.stanford.nlp.ie.QuantifiableEntityNormalizer.normalizedDateString(QuantifiableEntityNormalizer.java:338)
at edu.stanford.nlp.ie.QuantifiableEntityNormalizer.processEntity(QuantifiableEntityNormalizer.java:1018)
at edu.stanford.nlp.ie.QuantifiableEntityNormalizer.addNormalizedQuantitiesToEntities(QuantifiableEntityNormalizer.java:1320)
at edu.stanford.nlp.ie.NERClassifierCombiner.classifyWithGlobalInformation(NERClassifierCombiner.java:145)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.classifySentenceWithGlobalInformation(AbstractSequenceClassifier.java:322)
at edu.stanford.nlp.pipeline.NERCombinerAnnotator.doOneSentence(NERCombinerAnnotator.java:148)
at edu.stanford.nlp.pipeline.SentenceAnnotator.annotate(SentenceAnnotator.java:95)
at edu.stanford.nlp.pipeline.NERCombinerAnnotator.annotate(NERCombinerAnnotator.java:137)
at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:67)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:847)

EDIT

I am looking at this again because I am addressing some sutime portions of my code currently and I can reproduce by simply doing:

    ISODateInstance idi = new ISODateInstance();
    boolean fields = idi.extractFields("200912031000");
    System.out.println(fields);

Note that true is the printed value.


Solution

  • Ok, so let me say why the problem existed. There were two problems with extractDay() in 3.4.1:

    1. Integer.valueOf is used in line 1107. This creates the error we see because the String, if it were to be construed as a number, certainly would be a Long. Long.valueOf is used in later versions.
    2. False should be returned from extractDay because it was unable to do anything with that string. However, the try block (line 1106) is inside the for loop (line 1097) meaning that after a failure, more tokens could be examined leading to the method eventually returning true. This will allow the annotation to be created even though technically no annotation should be created since parsing failed. The try was moved outside of the for block in later versions.

    So the only answer is update to a later version (although I can't update to a later version still at this time).