Search code examples
nlpmachine-learninginformation-extractionnamed-entity-recognitiontemporal

How to find references to dates in natural text?


What I want to do is to parse raw natural text and find all the phrases that describe dates.

I've got a fairly big corpus with all the references to dates marked up:

I met him <date>yesterday</date>.
Roger Zelazny was born <date>in 1937</date>
He'll have a hell of a hangover <date>tomorrow morning</date>

I don't want to interpret the date phrases, just locate them. The fact that they're dates is irrelevant (in real life they're not even dates but I don't want to bore you with the details), basically it's just an open-ended set of possible values. The grammar of the values themselves can be approximated as context-free, however it's quite complicated to build manually and with increasing complexity it gets increasingly hard to avoid false positives.

I know this is a bit of a long shot so I'm not expecting an out-of-the-box solution to exist out there, but what technology or research can I potentially use?


Solution

  • One of the generic approaches used in academia and in industry is based on Conditional Random Fields. Basically, it is a special probabilistic model, you train it first with your marked up data and then it can label certain types of entities in a given text.

    You can even try one of the systems from Stanford Natural Language Processing Group: Stanford Named Entity Recognizer

    When you download the tool, note there are several models, you need the last one:

    Included with the Stanford NER are a 4 class model trained for CoNLL, a 7 class model trained for MUC, and a 3 class model trained on both data sets for the intersection of those class sets.

    3 class Location, Person, Organization

    4 class Location, Person, Organization, Misc

    7 class Time, Location, Organization, Person, Money, Percent, Date

    Update. You can actually try that tool online here. Select the muc.7class.distsim.crf.ser.gz classifier and try some text with dates. It doesn't seem to recognize "yesterday", but it recognizes "20th century", for example. In the end, this is a matter of CRF training.


    Stanford NER screenshot