Search code examples
pythondatetimeparsingfeature-engineeringdateparser

How to use dateparser to detect dates in strings?


I want to use dateparser to detect which cell contains a date. I have a broad range of different date formats: Fr, 21.02.2020 // 20.02.2020 // 21.02 // 21-02-2020 // January, 21 2020 // 21-Jan-2020 // 21/02/20 and I am sure there will still come a couple more in the future. The library dateparser is able to detect all of them pretty well, though it also detects 'PO', 'to','06','16:00' as date or relative date, which I don't want. I tried to check the Documentation and turn the relative date off or to look how to change to only detect "real dates". In the settings they offer different PARSERS and the possibility to only use some of them. These are the default PARSERS and the program runs through all of them:

'timestamp': If the input string starts with 10 digits, optionally followed by additional digits or a period (.), those first 10 digits are interpreted as Unix time.

'relative-time': Parses dates and times expressed in relation to the current date and time (e.g. “1 day ago”, “in 2 weeks”).

'custom-formats': Parses dates that match one of the date formats in the list of the date_formats parameter of dateparser.parse() or DateDataParser.get_date_data.

'absolute-time': Parses dates and times expressed in absolute form (e.g. “May 4th”, “1991-05-17”). It takes into account settings such as DATE_ORDER or PREFER_LOCALE_DATE_ORDER.

'base-formats': Parses dates that match one of the following date formats

I tried to only use one of them with the part settings={'base-formats':True}) in my code, nonetheless it won't work. Furthermore they offer the following snippet to turn of individual PARSERS:

>>> from dateparser.settings import default_parsers
>>> parsers = [parser for parser in default_parsers if parser != 'relative-time']
>>> parse('today', settings={'PARSERS': parsers})

Here pops up the error:

ModuleNotFoundError: No module named 'dateparser.settings'

I tried pip install, won't work.

Link to docu: https://dateparser.readthedocs.io/en/latest/#settings

And here's my code:

import dateparser

inputlist = [[' ','Supplier:',' Company Y', ' ', 'Project:','Carasco', ' '],[' ','21-Jan-2020',' ','Consultant:','James Farewell', ' ', ' '],['PO', ' Service', ' Cost Center', ' Accounting Object', ' deliver at', ' Amount', ' Unit'],['0106776','XYZ', 'Countable',' ', '16:00','6,00','h',],['Fr, 21.02.2020', '20.03.2020', ' ', ' ', ' ', ' ','6/04/20']]
print(inputlist)

outerlist=[]
for row in inputlist:
    innerlist = []
    for cell in row:
        parsecheck = dateparser.parse(cell,  languages=['en', 'de'], settings={'base-formats':True})
        if parsecheck == None:
            innerlist.append(0)
        else:
            innerlist.append(1)

    outerlist.append(innerlist)

print(outerlist)

I currently get:

[0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0], [1, 0, 0, 0, 1, 1, 1], [1, 1, 0, 0, 0, 0, 1]]

Desired Output:

[0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0,0, 0, 0], [1, 1, 0, 0, 0, 0, 1]]

Solution

  • This is the best I could do:

    import dateparser
    import locale
    
    inputlist = [[' ','Supplier:',' Company Y', ' ', 'Project:','Carasco', ' '],[' ','21-Jan-2020',' ','Consultant:','James Farewell', ' ', ' '],['PO', ' Service', ' Cost Center', ' Accounting Object', ' deliver at', ' Amount', ' Unit'],['0106776','XYZ', 'Countable',' ', '16:00','6,00','h',],['Fr, 21.02.2020', '20.03.2020', ' ', ' ', ' ', ' ','6/04/20']]
    print(inputlist)
    
    customlist = ["%d.%m.%Y", "%d-%b-%Y", "%w/%m/%y", "%a, %d.%m.%Y"]
    
    outerlist=[]
    saved = locale.setlocale(locale.LC_ALL)
    locale.setlocale(locale.LC_ALL, 'de_de')
    for row in inputlist:
        innerlist = []
        for cell in row:
            parsecheck = dateparser.parse(cell,  languages=['en', 'de'], settings={'PARSERS':['custom-formats']}, date_formats=customlist)
            if parsecheck == None:
                innerlist.append(0)
            else:
                innerlist.append(1)
        outerlist.append(innerlist)
    
    locale.setlocale(locale.LC_ALL, saved)
    print(outerlist)
    

    The output is:

    [[0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [1, 1, 0, 0, 0, 0, 1]]
    

    For parsing Fr, 21.02.2020 I changed the locale to Germany and, near the end I go back to your initial locale.

    The format was based on documentation of strftime() and strptime() Behavior