Search code examples
rdatetimetextsplitcomments

Split Comment at Each TimeStamp


Hey, I have a comment with various time stamps in a single cell as follows:-

2019-07-26 20:36:19 - (Work notes) Informed the caller that the transactions are removed from Concur. Resolved the INC as no action pending. Send resolution email to caller, copy paste the response and simplify/summarize the information received from the Eng team YesUpdate the work notes YesUpdate the state to Awaiting User Yes

2019-07-26 10:32:05 - oneflow (Work notes)[code] Hi Team.We have removed those gits.

What i want is to split this cell into rows so that each time stamp is split with its respective text.

Please help. Any code in R or Python will help.


Solution

  • Python option using regex:

    import re
    
    s = """2019-07-26 20:36:19 - (Work notes) Informed the caller that the [...]
    line without timestamp!
    2019-07-26 10:32:05 - oneflow (Work notes)[code] Hi Team.We have removed those gits."""
    
    # search for the timestamps
    timestamps = re.findall(r'\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}', s)
    
    # if timestamps were found, obtain their indices in the string:
    if timestamps:
        idx = [s.index(t) for t in timestamps] + [None] # add None to get the last part...
    
        # split the string and put the results in tuples:
        text_tuples = []
        l = len(timestamps[0]) # how many characters to expect for the timestamp
        for i, j in zip(idx[:-1], idx[1:]): # use zip to iterate over two sequences at once
            text_tuples.append((s[i:i+l], # timestamp
                                s[i+l:j].strip(' - '))) # part before next timestamp
    
    # text_tuples
    # [('2019-07-26 20:36:19',
    #   '(Work notes) Informed the caller that the [...]\nline without timestamp!\n'),
    #  ('2019-07-26 10:32:05',
    #   'oneflow (Work notes)[code] Hi Team.We have removed those gits.')]
    

    In this example you will get a list of tuples containing the timestamp and the corresponding rest of the line. If a line has no timestamp, it will not go into the output.


    edit: extension to a pandas DataFrame after OP's comment:

    import re
    import pandas as pd
    
    # create a custom function to split the comments:
    def split_comment(s):
        # search for the timestamps
        timestamps = re.findall(r'\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}', s)
    
        # if timestamps were found, obtain their indices in the string:
        if timestamps:
            idx = [s.index(t) for t in timestamps] + [None] # add None to get the last part...
            # split the string and put the results in tuples:
            splitted = []
            l = len(timestamps[0]) # how many characters to expect for the timestamp
            for i, j in zip(idx[:-1], idx[1:]): # use zip to iterate over two sequences at once
                splitted.append([s[i:i+l], # timestamp
                                 s[i+l:j].strip(' - ')]) # part before next timestamp
            return splitted
        return ['NaT', s] # no timestamp found, return s
    
    s0 = """2019-07-26 20:36:19 - (Work notes) Informed the caller that the [...]
    line without timestamp!
    2019-07-26 10:32:05 - oneflow (Work notes)[code] Hi Team.We have removed those gits."""
    s1 = "2019-07-26 20:36:23  another comment"
    
    # create example df
    df = pd.DataFrame({'s': [s0, s1], 'id': [0, 1]})
    
    # create a dummy column that holds the resulting series we get if we apply the function:
    df['tmp'] = df['s'].apply(split_comment)
    
    # explode the df so we have one row for each timestamp / comment pair:
    df = df.explode('tmp').reset_index(drop=True)
    
    # create two columns from the dummy column, 'timestamp' and 'comment':
    df[['timestamp', 'comment']] = pd.DataFrame(df['tmp'].to_list(), index=df.index)
    
    # drop stuff we dont need anymore:
    df = df.drop(['s', 'tmp'], axis=1)
    
    # so now we have:
    # df
    #    id            timestamp                                            comment
    # 0   0  2019-07-26 20:36:19  (Work notes) Informed the caller that the [......
    # 1   0  2019-07-26 10:32:05  oneflow (Work notes)[code] Hi Team.We have rem...
    # 2   1  2019-07-26 20:36:23                                    another comment