Let me first share a text:
I am Fox Sin of Greed came on Earth in 1666 BC. due date right after
St. P was build in 16.05.1703 and bluh bluh I moved to Moscow Feb
2nd, 2022 to work as per deadline And today I read manga Due date for
my project is September 12, 2022 I wonder if Ill be able to pay by Oct
07, 2023 and so The deadline is unknown by I assume would be 9102023
Bluh bluh Due Date 12-11-2022 30/08/2021 and 9/19/23
This is a randomly generated text to test dateparser and regex. I wrote a function that is pretty good at recognising dates with regex, but excluding those that are in format [month as letters] [date as number], [year as number] This is where I usually use dateparser as it's capable of recognising those.. However, when there are 'trigger words' such as 'may' 'to pay'(??) and such it fails. Example:
I moved to Moscow Feb 2nd, 2022 to work as per deadline
[('to', datetime.datetime(2022, 9, 8, 0, 0)), ('Feb 2nd, 2022 to', datetime.datetime(2022, 2, 2, 0, 0))]
This is good. It regognised ''Feb 2nd, 2022' even tho added 'to' to 'it'.
But next one:
I wonder if Ill be able to pay by Oct 07, 2023 and so
[('to pay', datetime.datetime(2022, 9, 8, 0, 0)), ('07, 2023', datetime.datetime(2023, 7, 8, 0, 0))]
it failed to connect october to '07, 2023'.
This is used in extracting data from invoices and I have no control over in which formats dates come, so I was wondering if more experienced/skilled dateparser (possibly other python tools) users can help me avoid this problem. Rn it seems to me that I need to avoid words such as 'may', 'to pay', 'now' etc.
If you know language of target text, you might provide it, which should prevent problems caused by bad language guess. After specifying language en
I get one date as expected that is
from dateparser.search import search_dates
print(search_dates('I wonder if Ill be able to pay by Oct 07, 2023 and so',languages=['en']))
gives output
[('by Oct 07, 2023 and', datetime.datetime(2023, 10, 7, 0, 0))]
Nonetheless docs claims that
Warning Support for searching dates is really limited and needs a lot of improvement
so you should be prepared that you might still get results not as desired.