Search code examples
pythondatetext-extractiondatefinder

How to correctly extract various Date formats from Text in Python


I have to extract all the available dates from a PDF and then check among the dates which is Contract Date.

For that first I want to extract all the Dates in the Text that i have extracted from PDF. Now the Dates can be in various formats. I have tried adding all flavours of dates in the below example.

I tried using Datefinder Python module to extract all the dates. Although it comes close but throws few garbage dates initially and also doesn't match the first Date correctly.

import datefinder

dateContent = """ Test
I want to apply for leaves August,​ ​11,​ ​2017 I want to apply for leaves Aug, 23, 2017 I want to apply for leaves Aug, 21, 17 
I want to apply for leaves August 20 2017
I want to apply for leaves August 30th, 2017 I want to apply for leaves August 31st 17
I want to apply for leaves 8/26/2017 I want to apply for leaves 8/27/17
I want to apply for leaves 28/8/2017 I want to apply for leaves 29/8/17 I want to apply for leaves 30/08/17
I want to apply for leaves 15 Jan 17 I want to apply for leaves 14 January 17
I want to apply for leaves 13 Jan 2017
I want to apply for leaves Jan 10 17 I want to apply for leaves Jan 11 2017 I want to apply for leaves January 12 2017
"""

matches = datefinder.find_dates(dateContent)

for match in matches:
    print(match)

Response :

2019-08-05 00:00:00

2019-06-11 00:00:00

2017-06-05 00:00:00

2017-08-23 00:00:00

2017-08-21 00:00:00

2017-08-20 00:00:00

2017-08-30 00:00:00

2017-08-31 00:00:00

2017-08-26 00:00:00

2017-08-27 00:00:00

2017-08-28 00:00:00

2017-08-29 00:00:00

2017-08-30 00:00:00

2017-01-15 00:00:00

2017-01-14 00:00:00

2017-01-13 00:00:00

2017-01-10 00:00:00

2017-01-11 00:00:00

2017-01-12 00:00:00

As you can see, I have 17 such Date objects, but i am getting 19. Checking from bottom, last 16 match correctly. Then there is those initial Garbage. Once i get these Dates correctly, i can move forward with some kind of N-Gram model to check which Dates Context is to Contractual Information.

Any help in resolving the issue would be great.


Solution

  • I resolved the issue. Actually there were some encoding issue in my text content.

    dateContent = dateContent.replace(u'\u200b', '')
    

    Replacing \u200b with empty character fixed the issue. Datefinder Module does rest of the work of finding all the different Date Formats.