Search code examples
pythonspacynamed-entity-recognitiondateparserdatefinder

What is the most accurate way to detect dates in text?


I'm working on a sensitive data recognition (NER) task. Faced with the fact that I can not accurately detect dates in texts. I've tried almost everything...

For example I have this type of dates in my text:

date_list = ['23 octbr', '08/10/1975', '2/20/1961', 'December 23', '2021', '1/10/1980', ...]

But I must say that there is also a lot of numerical information in the text, for example, IP addresses, house addresses, bank card numbers, etc.

This is an example of how Spacy works:

'08/10/1975' -> Entityt type: No Entity
'2/20/1961' -> Entityt type: DATE
'1/10/1980' -> Entityt type: CARDINAL

Or for example I have phone number "(150) 224-2215" and it Spacy marks the part "24-2215" as a Date. It often happens with adresses and credit card numbers too.

Then I have tried datefinder and dateparser.search, but they detected completely incorrect parts of the sentence or those that contained the word "to".

Can you please share your experience, what could work better? What is the best way to get high accuracy of date detection?


Solution

  • What does your corpus include, does it include full sentences?

    • First of all you can try spaCy NER with context. NER algorithms work on full sentences.

    • If you look for a more token/shape oriented solution, I suggest context free parsing. A context free grammar is great for describing dates. Basically you define some grammar rules such as:

    calendar_year -> full_year | year
    year -> 19\d{,2} | 20\d{,2}
    full_year -> day/month/year | day.month.year
    day -> digit_num | two_digit_num
    month -> digit_num | two_digit_num
    digit_num -> 0 | 1 | 2 ... |9
    

    Regex is not a good idea here, because it has no "context" i.e. parsed characters are not aware of what have been parsed before, there is no memory. Context free grammars offer a structured way to parse strings and offer parse trees as well.

    This is how I did it with Lark, dates are in German: https://duygua.github.io/blog/2018/03/28/chatbot-nlu-series-datetimeparser/