Search code examples
pythonpython-dateutil

Return directives used by dateutil.parser


Is there a way to get back the directives that dateutil used to parse a date?

from dateutil import parser

dstr = '2017/10/01 16:44'
dtime = parser.parse(dstr)

What I would like is the ability to get '%Y/%m/%d %H:%M' back somehow.


Solution

  • No, the parser in dateutil has no support for extracting a format. The parser uses a mix of tokenizing and heuristics to try to figure out what the various numbers and words in the input could mean, and no 'format' is build up during this process.

    Your best bet is to search the input string for the fields from the resulting datetime object and produce a format from that.

    For your specific example, that is a reasonable option, because all the resulting values are unique. If your inputs do not have unique values, you'll have include heuristics where you use multiple examples to increase the certainty of a correct match.

    For example, for your specific example, you can find unique positions for all the datetime components presented as strings, starting with '2017', '10', etc. However, for other examples you'll have to search for different variants of string representations of those components, like a 2-year format, or month, day, hour or minute components not using zero-padding, and you need to account for a 12-hour clock representation.

    I haven't directly tried this, but I strongly suspect that this is a problem very suitable for the Aho–Corasick algorithm, which lets you find positions of matching known strings (the dictionary, here your various datetime components formatted as strings, plus potential delimiter characters) in an input string. Once you have those positions, and you have resolved the ambiguities, you can construct a format string from those. You can probably narrow down the number of possible component formats by looking for tell-tale strings like pm or weekdays or month names.

    There are ready-made Python implementations, like the pyahocorasick package. With that library I was able to make a pretty good approximation in a few steps:

    >>> from dateutil import parser
    >>> import ahocorasick
    >>> A = ahocorasick.Automaton()
    >>> dstr = '2017/10/01 16:44'
    >>> dtime = parser.parse(dstr)
    >>> formats = 'dmyYHIpMS'
    >>> for f in formats:
    ...     _ = A.add_word(dtime.strftime(f'%{f}'), (False, f))
    ...
    >>> for p in ':/ ':
    ...     _ = A.add_word(p, (True, p))
    ...
    >>> A.make_automaton()
    >>> for end_index, (punctuation, char) in A.iter(dstr):
    ...     print(end_index, char if punctuation else f'%{char}')
    ...
    2 %d
    3 %Y
    3 %y
    4 /
    6 %m
    7 /
    9 %d
    10
    12 %H
    13 :
    15 %M
    

    You could include priorities, and only output a specific formatter when punctuation is reached; that'll resolve the %d / %Y / %y clash at the start.