Search code examples
regexdatestanford-nlpsutime

SUTime SequenceMatchRules for "c. DATE - DATE BC"


I'm fighting with Stanford's SequenceMatchRules for recognizing the following input as two dates:

Anaximander (c. 610 – c. 546 BC) was a pre-Socratic Greek philosopher who lived in Miletus, a city of Ionia (in modern-day Turkey).

(taken from the Pantheon dataset, e.g. http://pantheon.media.mit.edu)

'546 BC' works just fine, but I also want to recognize '610' as '610 BC' (preferably NOT as a duration).

What I did so far just to get things going:

Modified english.sutime.txt:

Changed

$POSSIBLE_YEAR = ( $YEAR /a\.?d\.?|b\.?c\.?/? | $INT /a\.?d\.?|b\.?c\.?/ | $INT1000TO3000 );

to

$POSSIBLE_YEAR = ( $YEAR /a\.?d\.?|b\.?c\.?/? | $INT /a\.?d\.?|b\.?c\.?/ | /c\.\ / $INT | $INT1000TO3000 );

And in the pattern: ( $POSSIBLE_YEAR)... extraction rule:

          Tag($0, "YEAR_ERA",
            :case {
               $0 =~ ( $INT /a\.?d\.?/ ) => ERA_AD,
               $0 =~ ( $INT /b\.?c\.?/ ) => ERA_BC,
               :else => ERA_UNKNOWN
            }
          )

to

          Tag($0, "YEAR_ERA",
            :case {
               $0 =~ ( $INT /a\.?d\.?/ ) => ERA_AD,
               $0 =~ ( /c\.\ / $INT ) => ERA_BC,
               $0 =~ ( $INT /b\.?c\.?/ ) => ERA_BC,
               :else => ERA_UNKNOWN
            }
          )

First it's ugly, second it didn't work at all.

Where should I begin to get this right?

I'm using the stanford-corenlp-full-2018-10-05.

I should mention that Pantheon is not perfectly normalized, so I have to deal with additional stuff like CE/BCE, missing spaces around dates etc later. Therefore an extendable approach would be great.


Solution

  • I think this rule would match c. 610 ... if it sees the pattern it will attach the corresponding IsoDate to it. Please let me know if that works or not...if not I can figure out what's broken.

    { (/c\./ (/[0-9]{3,4}/)) => IsoDate($1[0].numcompvalue, NIL, NIL, 0, FALSE) }
    

    Here is the constructor for IsoDate that takes in era for reference:

    public IsoDate(Number y, Number m, Number d, Number era, Boolean yearEraAdjustNeeded) {
      this.year = (y != null)? y.intValue():-1;
      this.month = (m != null)? m.intValue():-1;
      this.day = (d != null)? d.intValue():-1;
      this.era = (era != null)? era.intValue():ERA_UNKNOWN;
      if (yearEraAdjustNeeded != null && yearEraAdjustNeeded && this.era == ERA_BC) {
        if (this.year > 0) {
          this.year--;
        }
      }
      initBase();
    }
    

    If that rule works, it should demonstrate how to match a text pattern and attach the desired year. It might be easiest to just write a pantheon_rules.txt file and add it your list of SUTime rules that covers everything you want, once you have that basic rule down you can extend it to match the cases you want. I could also work on adding some rules for handling these cases into the official release at some point.