Search code examples
pythonvalidationdatepython-dateutil

dateutil.parser: how to deal with dd/mm and mm/dd in same column?


I am parsing a CSV file, one of the columns is "datetime". These CSVs are badly formed, and some CSV files have 28-02-2018T00:00:00.000+1000 and other CSV files have 2018-02-28T00:00:00.000+1000.

If I do:

dateutil.parser.parse(my_csv["timestamp"])

Whether my_csv["timestamp"] is 02-28-2018T00:00:00.000+1000 or 2018-02-28T00:00:00.000+1000 is irrelevant. It will be correct if I don't specify a format, because the library will recognise there is no "month 28", thus it will choose the correct format on it's own.

But how to deal with cases where the day and month are valid for both format slots?

2018-02-04 and 04-02-2018 are both the same date, but one has %m-%d and the other has %d-%m.

Can I tell the parser the format will be either %y-%m-%d OR %d-%m-%y?

Is there an additional parameter I can use when I parse, to tell the parser, if the 4 digits for %y show up first, then use %y-%m-%d otherwise use %d-%m-%y?


Solution

  • Currently the customization options for the dateutil parser are minimal and there is no way to specify what you want.

    However, you if it is just the two formats, I recommend not using dateutil's parser at all. You can parse these dates with a function that tries one and then the other format:

    from datetime import datetime
    
    def parse_myformats(dtstr):
        try:
            return datetime.strptime(dtstr, '%Y-%m-%dT%H:%M:%S.%f%z')
        except ValueError:
            return datetime.strptime(dtstr, '%d-%m-%YT%H:%M:%S.%f%z')
    

    This assumes Python 3 (%z directive). In Python 2 you will have to strip off the last 5 digits and parse the time zone separately.

    That said, since the first datetime is an ISO8601 datetime, you can also use dateutil.parser.isoparse as the first branch of the conditional, and use parse as the fallback:

    from datetime import datetime
    from dateutil import parser
    
    def parse_myformats_du(dtstr):
        try:
            return parser.isoparse(dtstr)
        except ValueError:
            return parser.parse(dtstr, dayfirst=True)
    

    This version works in both Python 2 and 3 with no additional modifications, though it will likely be slower on the branch that calls dateutil.parser.parse. See it in action:

    >>> parse_myformats('2018-02-04T00:00:00.000+1000')
    datetime.datetime(2018, 2, 4, 0, 0, tzinfo=datetime.timezone(datetime.timedelta(0, 36000)))
    
    >>> parse_myformats('04-02-2018T00:00:00.000+1000')
    datetime.datetime(2018, 2, 4, 0, 0, tzinfo=datetime.timezone(datetime.timedelta(0, 36000)))
    
    >>> parse_myformats_du('2018-02-04T00:00:00.000+1000')
    datetime.datetime(2018, 2, 4, 0, 0, tzinfo=tzoffset(None, 36000))
    
    >>> parse_myformats_du('04-02-2018T00:00:00.000+1000')
    datetime.datetime(2018, 2, 4, 0, 0, tzinfo=tzoffset(None, 36000))
    

    If you are concerned with speed, here are the IPython %timeit microbenchmarks for these:

    %timeit parse_myformats('2018-02-04T00:00:00.000+1000')
    31.8 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
    %timeit parse_myformats('04-02-2018T00:00:00.000+1000')
    45.1 µs ± 2.34 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
    %timeit parse_myformats_du('2018-02-04T00:00:00.00+1000')
    31.3 µs ± 574 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
    %timeit parse_myformats_du('04-02-2018T00:00:00.000+1000')
    191 µs ± 2.99 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)