Search code examples
pythondatetimepython-dateutil

Separating two datetime values from a single string


I need to write a method to take in a string containing two datetime values, and separate out the values. These datetime values can be in any valid ISO-8601 format which means I can't just split on a character index. The values will be separated by a hyphen which also means I can't just use str.split() either.

I've written this function using some Reg-Ex, but the client has asked me to use python-dateutil instead.

def split_range(times):
    regex = re.compile("[0-9]{4}-?[0-9]{2}-?[0-9]{2}([T]([0-9]{2}:?){2,3}(\.[0-9]{3})?)?Z?")
    split_times = regex.finditer(times)
    final_times = []

    for time in split_times:

        time = time.group(0)

        datetime_value = datetime.fromisoformat(time)
        final_times.append(datetime_value.isoformat())

    return final_times

This function should take in a string like this: (these are all the strings I use in my tests)

20080809-20080815

2008-08-08-2008-08-09

2008-08-08T17:21-2008-08-09T17:31

2008-08-08T17:21-2008-08-09T17:31

2008-08-08T17:21:000-2008-08-09T17:31:000

2008-08-08T17:21:000-2008-08-09T17:310:00

2008-08-08T17:21:000.000-2008-08-09T17:31:000.000

and split it into two separate values

ex. 2019-08-08 & 2019-08-09

The client doesn't really like the use of regex here, and would like me to replace it with using dateutil, but I haven't seen anything that seems like it would do what I need. Is there a dateutil method I can use to accomplish this, and if not, is there another library that does have something?


Solution

  • I think the best thing to do might be to ask your client to change the delimiter from - to something else like a space or a tab or something that will not show up in an ISO 8601 string and split on that, but if you must use - as a delimiter and you must support any valid ISO 8601 string, your best option is to try and look for the pattern -(--|\d{4}), since all valid ISO 8601 datetimes will either start with 4 digits or they will start with --. If you find a dash followed by 4 digits, you have either found a negative time zone or the beginning of your next ISO 8601 datetime.

    Additionally, there are no valid ISO 8601 datetime formats that contain \d{4}-\d{4} and if you find a -(\d{4}) that represents a time zone offset, it is necessarily at the end of your first ISO 8601 string, so it is sufficient to use a negative lookahead to ensure that the pattern is not repeated, so, putting it all together:

    import re
    from dateutil.parser import isoparse
    
    
    def parse_iso8601_pairs(isostr):
        # In a string containing two ISO 8601 strings delimited by -, the substring
        # "-\d{4}" is only found at the beginning of the second datetime or the
        # end of *either* datetime. If it is found at the end of the first datetime,
        # it will always be followed by `-\d{4}`, so we can use negative lookahead
        # to find the beginning of the next string.
        #
        # Note: ISO 8601 datetimes can also begin with `--`, but parsing these is
        # not supported yet in dateutil.parser.isoparse, as of verison 2.8.0. The
        # regex includes this type of string in order to make at least the splitting
        # method work even if the parsing method doesn't support "missing year"
        # ISO 8601 strings.
        m = re.search(r"-(--|\d{4})(?!-(--|\d{4}))", isostr)
        dt1 = None
        dt2 = None
    
        if m is None:
            raise ValueError(f"String does not contain two ISO 8601 datetimes " +
                             "delimited by -: {isostr}")
    
        split_on = m.span()[0]
        str1 = isostr[0:split_on]
        str2 = isostr[split_on + 1:]
    
        # You may want to wrap the error handling here with a nicer message
        dt1 = isoparse(str1)
        dt2 = isoparse(str2)
    
        return dt1, dt2
    

    As far as I know this will work for any pair of ISO 8601-compliant strings delimited by - except the obscure "year missing" format: --MM-?DD. The splitting portion of the code will work even in the face of strings like --04-01, but dateutil.parser.isoparse does not currently support that format, so the parse will fail. Perhaps more problematic is the fact that --MMDD is also a valid ISO8601 format, and that will match -\d{4} and give an erroneous split. If you want to support that format and you have a modified parser that can handle --MMDD, I believe you can make a more complicated regex that handles the --MMDD case (and if someone wants to do this I'll be happy to edit it into the article), or you can simply "guess and check" by iterating over matches using re.finditer until you find a place to split your string that yields a valid ISO 8601 datetime on both sides of the delimiter.

    Note: This method will also work if you substitute in datetime.datetime.fromisoformat for dateutil.parser.isoparse. The difference is that datetime.datetime.fromisoformat parses strings that are mostly a subset of what dateutil.parser.isoparse handles - it is the inverse of datetime.datetime.isoformat and will parse anything that could be created by calling the isoformat method on a datetime object, wherease isoparse is intended to parse anything that is a valid ISO 8601 string. If you know that your datetimes were produced by calling the isoformat() method, then fromisoformat is the better choice of ISO 8601 parser.