Search code examples
pythondatepython-dateutil

remove recognized date from string


As input I have several strings containing dates in different formats like

  • "Peter drinks tea at 16:45"
  • "My birthday is on 08-07-1990"
  • "On Sat 11 July I'll be back home"

I use dateutil.parser.parse to recognize the dates in the strings.
In the next step I want to remove the dates from the strings. Result should be

  • "Peter drinks tea at "
  • "My birthday is on "
  • "On I'll be back home"

Is there a simple way to achieve this?


Solution

  • You can use the fuzzy_with_tokens option to dateutil.parser.parse:

    from dateutil.parser import parse
    
    dtstrs = [
        "Peter drinks tea at 16:45",
        "My birthday is on 08-07-1990",
        "On Sat 11 July I'll be back home",
        ]
    
    out = [
        parse(dtstr, fuzzy_with_tokens=True)
        for dtstr in dtstrs
    ]
    

    Result:

    [(datetime.datetime(2018, 7, 17, 16, 45), ('Peter drinks tea at ',)),
     (datetime.datetime(1990, 8, 7, 0, 0), ('My birthday is on ',)),
     (datetime.datetime(2018, 7, 11, 0, 0), ('On ', ' ', " I'll be back home"))]
    

    When fuzzy_with_tokens is true, the parser returns a tuple of a datetime and a tuple of ignored tokens (with the used tokens removed). You can join them back into a string like this:

    >>> ['<missing>'.join(x[1]) for x in out]
    ['Peter drinks tea at ',
     'My birthday is on ',
     "On <missing> <missing> I'll be back home"]
    

    I'll note that the fuzzy parsing logic is not amazingly reliable, because it's very difficult to pick out only valid components from a string and use them. If you change the person drinking tea to someone named April, for example:

    >>> dt, tokens = parse("April drinks tea at 16:45", fuzzy_with_tokens=True)
    >>> print(dt)
    2018-04-17 16:45:00
    >>> print('<missing>'.join(tokens))
     drinks tea at 
    

    So I would urge some caution with this approach (though I can't really recommend a better approach, this is just a hard problem).