I have some texts that generally starts with:
“12 minutes ago - There was a meeting...”
“2 hours ago - Apologies for being...”
“1 day ago - It is a sunny day in London...”
and so on. Basically I have information on:
Minutes
Hours
Day (starting from today)
I would like to transform this kind of information into valuable time serie information, in order to extract this part and create a new column from that (Datetime). In my dataset, I have one column (Date) where I have already the date of when the research was performed (for example, today), in this format: 26/05/2020 and when the search was submitted (e.g. 8:41am). So if the text starts with “12 minutes ago”, I should have:
26/05/2020 - 8:29 (datetime format in Python)
And for others:
26/05/2020 - 6:41
25/05/2020 - 8:41
The important thing is to have something (string, numeric, date format) that I can plot as time series (I would like to see how many texts where posted in terms of time interval). Any idea on how I could do this?
If the format stays simple : <digits> <unit> ago ...
it's pretty to parse with "^(\d+) (\w+) ago"
.
Then, once you have ('minutes', '12')
you'll pass these to timedelta
which accepts every unit as a keyword argument timedelta(minutes=12)
, you'll do that by passing a mapping **{unit:value}
def parse(content):
timeparts = re.search(r"^(\d+) (\w+) ago", content)
if not timeparts:
return None, content
unit = timeparts.group(2).rstrip('s') + 's' # ensure ends with 's'
#return datetime.now()-timedelta(**{unit:int(timeparts.group(1))}) # Now date
return datetime(2020,5,26,8,0,0)-timedelta(**{unit:int(timeparts.group(1))}) # Fixed date
Demo
values = ["12 minutes ago - There was a meeting...","2 hours ago - Apologies for being...","1 day ago - It is a sunny day in London..."]
for value in values:
res = parse(value)
print(res)
2020-05-26 07:48:00
2020-05-26 06:00:00
2020-05-25 08:00:00