Search code examples
pythondatetimeparsingpython-dateutil

dateutil.parser strange result when parsing date in the format "%Y:%m:%d"


I love dateutil.parser and use it often so I don't have to worry about inconsistent datetime formats and in most cases it has been very reliable for me.

But then today I've spent an hour debugging trying to understand why dates I'm populating in the database are all from 2022. The date format is: 2009:05:03 08:12:37 as stored in photo metadata and extracted using exif package.

It feels like it should be a reasonably straightforward format to parse but this is what I get with dateutil and datetime modules, respectively:

from dateutil.parser import parse
import datetime as dt

date_string = '2009:05:03 08:12:37'

wrong = parse(date_string, fuzzy=True)
correct = dt.datetime.strptime(date_string, "%Y:%m:%d %H:%M:%S")

print(wrong)
print(correct)

Out:

2022-07-25 08:12:37
2009-05-03 08:12:37

And both the normal and fuzzy parsing give the same result. And it's strange, I don't even understand how it could come up with this exact result.

It looks it tried to parse and did the time successfully but failed to parse date and put the current date by default.This feels like a very dangerous behaviour. It should raise an exception.


Solution

  • The dateutil.parser.parse parameter default says :

    The default datetime object, if this is a datetime object and not None, elements specified in timestr replace elements in the default object.

    Which is very clear in the parser.parse() code on Github :

            if default is None:
                default = datetime.datetime.now().replace(hour=0, minute=0,
                                                          second=0, microsecond=0)
    

    But the behavior being documented does not make it obvious, I agree with you that the result was surprising.

    As for why it failed, later it will _timelex.split(timestr) which ultimately calls _timelex.get_token(). It is a state machine implemented in pure Python to parse time strings, and it correctly returns the tokens ['2009', ':', '05', ':', '03', ' ', '08', ':', '12', ':', '37'].
    But then _parse() iterates on them tries to interpret them, so it calls _parse_numeric_token() on it. This function tries to match the tokens with specific cases. There are somes cases in which "ymd" (year/month/day) get set (what you would like) :

    • YYMMDD (first token without any space nor dot (.))
    • YYYYMMDD[HHMM[ss]] (first token without space, including time)
    • token followed by one of -/.
    • last token or followed by any of .,;-/' (or any word of at/on/and/ad/m/t/of/st/nd/rd/th)
    • token could be a day

    But instead it matches the case "followed by :" which then gets interpreted as HH:MM:SS into _result(hour=2009, minute=5, second=3, microsecond=0). Which consumes the 5 first tokens (the whole expected YMD). After that, it skips the whitespace, goes back into _parse_numeric_token(), matches the exact same case and overwites the HMS into _result(hour=8, minute=12, second=37, microsecond=0).
    It has not found any YMD, so there is nothing done by ymd.resolve_ymd().

    Back upto parse() (without the leading underscore), it builds a naive datetime (no timezone) by replacing parse() result fields into the default, which ends up being the final result.

    I think it may be cause for opening an issue on GitHub but I fear it may be considered "Won't fix" because the data you provided is strange : time-parts are usually delimited by colons while date-parts are (basically) never delimited by them (cf ISO 8601). I had never saw this datetime format before.

    I recommend instead that either you fix the code that is producing this malformed data, or that you add something like if malformed := re.match("(\d)+:(\d+):(\d+) (\d+:\d+:\d+)", date_string): date_string = f"{malformed.group(0)}-{malformed.group(1)}-{malformed.group(2)} {malformed.group(3)}" to reformat the data in a format expected by the dateutil parser.


    About fuzzy  :

    Whether to allow fuzzy parsing, allowing for string like “Today is January 1, 2047 at 8:21:00AM”.

    What it actually does is in the _parse() function, where it tries to find a match for the current token, if nothing matches, it will not raise a ValueError if fuzzy. So for any valid non-fuzzy input, the fuzzy result will be the same.
    In this case, it changes nothing (but it may be useful for other date_strings you receive, I can't tell).