I know there are lots of Q&As to extract datetime from string, such as dateutil.parser, to extract datetime from a string
import dateutil.parser as dparser
dparser.parse('something sep 28 2017 something',fuzzy=True).date()
output: datetime.date(2017, 9, 28)
but my question is how to know which part of string results this extraction, e.g. i want a function that also returns me 'sep 28 2017'
datetime, datetime_str = get_date_str('something sep 28 2017 something')
outputs: datetime.date(2017, 9, 28), 'sep 28 2017'
any clue or any direction that i can search around?
Extend to the discussion with @Paul and following the solution from @alecxe, I have proposed the following solution, which works on a number of testing cases, I've made the problem slight challenger:
Step 1: get excluded tokens
import dateutil.parser as dparser
ostr = 'something sep 28 2017 something abcd'
_, excl_str = dparser.parse(ostr,fuzzy_with_tokens=True)
gives outputs of:
excl_str: ('something ', ' ', 'something abcd')
Step 2 : rank tokens by length
excl_str = list(excl_str)
excl_str.sort(reverse=True,key = len)
gives a sorted token list:
excl_str: ['something abcd', 'something ', ' ']
Step 3: delete tokens and ignore space element
for i in excl_str:
if i != ' ':
ostr = ostr.replace(i,'')
return ostr
gives a final output
ostr: 'sep 28 2017 '
Note: step 2 is required, because it will cause problem if any shorter token a subset of longer ones. e.g., in this case, if deletion follows an order of ('something ', ' ', 'something abcd')
, the replacement process will remove something
from something abcd
, and abcd
will never get deleted, ends up with 'sep 28 2017 abcd'