Search code examples
pythonpandasstringtext-mining

Extract from string information on date/time


I have some texts that generally starts with:

“12 minutes ago - There was a meeting...”
“2 hours ago - Apologies for being...”
“1 day ago - It is a sunny day in London...”

and so on. Basically I have information on:

Minutes 
Hours
Day (starting from today)

I would like to transform this kind of information into valuable time serie information, in order to extract this part and create a new column from that (Datetime). In my dataset, I have one column (Date) where I have already the date of when the research was performed (for example, today), in this format: 26/05/2020 and when the search was submitted (e.g. 8:41am). So if the text starts with “12 minutes ago”, I should have:

26/05/2020 - 8:29 (datetime format in Python)

And for others:

26/05/2020 - 6:41
25/05/2020 - 8:41

The important thing is to have something (string, numeric, date format) that I can plot as time series (I would like to see how many texts where posted in terms of time interval). Any idea on how I could do this?


Solution

  • If the format stays simple : <digits> <unit> ago ... it's pretty to parse with "^(\d+) (\w+) ago".

    Then, once you have ('minutes', '12') you'll pass these to timedelta which accepts every unit as a keyword argument timedelta(minutes=12), you'll do that by passing a mapping **{unit:value}

    def parse(content):
        timeparts = re.search(r"^(\d+) (\w+) ago", content)
        if not timeparts:
            return None, content
        unit = timeparts.group(2).rstrip('s') + 's' # ensure ends with 's'
        #return datetime.now()-timedelta(**{unit:int(timeparts.group(1))})           # Now date
        return datetime(2020,5,26,8,0,0)-timedelta(**{unit:int(timeparts.group(1))}) # Fixed date
    

    Demo

    values = ["12 minutes ago - There was a meeting...","2 hours ago - Apologies for being...","1 day ago - It is a sunny day in London..."]
    
    for value in values:
      res = parse(value)
      print(res)
    
    
    2020-05-26 07:48:00
    2020-05-26 06:00:00
    2020-05-25 08:00:00