I'm working on a sensitive data recognition (NER) task. Faced with the fact that I can not accurately detect dates in texts. I've tried almost everything...
For example I have this type of dates in my text:
date_list = ['23 octbr', '08/10/1975', '2/20/1961', 'December 23', '2021', '1/10/1980', ...]
But I must say that there is also a lot of numerical information in the text, for example, IP addresses, house addresses, bank card numbers, etc.
This is an example of how Spacy
works:
'08/10/1975' -> Entityt type: No Entity
'2/20/1961' -> Entityt type: DATE
'1/10/1980' -> Entityt type: CARDINAL
Or for example I have phone number "(150) 224-2215"
and it Spacy
marks the part "24-2215"
as a Date. It often happens with adresses and credit card numbers too.
Then I have tried datefinder
and dateparser.search
, but they detected completely incorrect parts of the sentence or those that contained the word "to".
Can you please share your experience, what could work better? What is the best way to get high accuracy of date detection?
What does your corpus include, does it include full sentences?
First of all you can try spaCy NER with context. NER algorithms work on full sentences.
If you look for a more token/shape oriented solution, I suggest context free parsing. A context free grammar is great for describing dates. Basically you define some grammar rules such as:
calendar_year -> full_year | year
year -> 19\d{,2} | 20\d{,2}
full_year -> day/month/year | day.month.year
day -> digit_num | two_digit_num
month -> digit_num | two_digit_num
digit_num -> 0 | 1 | 2 ... |9
Regex is not a good idea here, because it has no "context" i.e. parsed characters are not aware of what have been parsed before, there is no memory. Context free grammars offer a structured way to parse strings and offer parse trees as well.
This is how I did it with Lark, dates are in German: https://duygua.github.io/blog/2018/03/28/chatbot-nlu-series-datetimeparser/