Search code examples
pythonimagetesseract

How to detect and extract date data formats from a OCR generated text files in python


I am creating a date extractor from images using python.

After reading the images and converting it to .txt files, I am having a list of text files containing dates in it which are in different formats.

like

20-april-2019

20-04-2019

20-4-19

apr-20-2019

Apr-20-19

20Apr-2019

and so on

I want to identify and extract date data from the above text. Any idea how to do this?


Solution

  • You can use the dateparser module

    import dateparser
    print(dateparser.parse('20Apr-2019'))
    

    Gives:

    2019-04-20 00:00:00
    

    dateparser.parse returns a datetime object.

    If your text file contains other strings, and the task is to identify the dates as well as extracting them, you can use dateparser.search.

    from dateparser.search import search_dates
    str1 = "Whurat UDAYA FILLING STATION MATTUPATTY ROAD MUNNAR 04865230318 ORIGINAL De DD De Da ED eH DAC Da a Da Oa DC Oa DO Dt Oe 29-MAY -2019 14:02:23 INVOICE NO: 292 i VEHICLE NO: NOT ENTERED (| NOZZLE NO : 1 f PRODUCT: PETROL RATE : 75.01 INR/Ltr VOLUME: 1.33 Ltr AMOUNT: 100.00 INR ek DA DH DE DC DE DRC DC DDC cok DE DC CDC DC DE DE S.T..No : 27430268741C M.S.T. No: 27430268741V pe TE ETA CT a DD OC DRE I BOC IE DOC Thank You! Visit Again"
    print(search_dates(str1))
    

    Which gives:

    [('04865230318', datetime.datetime(1985, 6, 2, 2, 17, 11)), ('29-MAY -2019 14:02:23', datetime.datetime(2019, 5, 29, 14, 2, 23)), ('292', datetime.datetime(1900, 1, 1, 2, 9, 2)), ('100', datetime.datetime(1900, 1, 1, 1, 0)), ('ek', datetime.datetime(1900, 10, 1, 0, 0)), ('TE', datetime.datetime(1900, 7, 1, 0, 0)), ('OC', datetime.datetime(1900, 1, 1, 0, 0))]
    

    As you can see, this may need some further filtering to eliminate false positives, but it should catch most of the dates you throw at it.