Search code examples
python-3.xdatetimeparsingpython-dateutil

how to get only date string from a long string


I know there are lots of Q&As to extract datetime from string, such as dateutil.parser, to extract datetime from a string

import dateutil.parser as dparser
dparser.parse('something sep 28 2017 something',fuzzy=True).date()

output: datetime.date(2017, 9, 28)

but my question is how to know which part of string results this extraction, e.g. i want a function that also returns me 'sep 28 2017'

datetime, datetime_str = get_date_str('something sep 28 2017 something')
outputs: datetime.date(2017, 9, 28), 'sep 28 2017'

any clue or any direction that i can search around?


Solution

  • Extend to the discussion with @Paul and following the solution from @alecxe, I have proposed the following solution, which works on a number of testing cases, I've made the problem slight challenger:

    Step 1: get excluded tokens

    import dateutil.parser as dparser
    
    ostr = 'something sep 28 2017 something abcd'
    _, excl_str = dparser.parse(ostr,fuzzy_with_tokens=True)
    

    gives outputs of:

    excl_str:     ('something ', ' ', 'something abcd')
    

    Step 2 : rank tokens by length

    excl_str = list(excl_str)
    excl_str.sort(reverse=True,key = len)
    

    gives a sorted token list:

    excl_str:   ['something abcd', 'something ', ' ']
    

    Step 3: delete tokens and ignore space element

    for i in excl_str:
        if i != ' ':
            ostr = ostr.replace(i,'') 
    return ostr
    

    gives a final output

    ostr:    'sep 28 2017 '
    

    Note: step 2 is required, because it will cause problem if any shorter token a subset of longer ones. e.g., in this case, if deletion follows an order of ('something ', ' ', 'something abcd'), the replacement process will remove something from something abcd, and abcd will never get deleted, ends up with 'sep 28 2017 abcd'