Search code examples
pythonregexlisttext

How to combine string with the previous one based on condition?


I have a text file of messages in the form:

with open(code_path) as f:
    contents = f.readlines()
print(contents)

['22/05/2022, 21.58 - Name: message1 \n', 
'22/05/2022, 22.07 – Name2: message2\n', 
'message2 continues\n', 
'22/05/2022, 22.09 – Name: message3\n']

Currently I have the messages in strings. Some long messages are split into two. I would like to have a list of the messages in with all messages joined together (starts with the date).

This is what I want:

['22/05/2022, 21.58 - Name: message1 \n', 
'22/05/2022, 22.07 – Name2: message2 + message2 continues\n', 
'22/05/2022, 22.09 – Name: message3\n']

Is there a way to do this?

I have found the strings starting with a date with:

import re

dates = [re.findall("^[0-3][0-9]/[0-3][0-9]/20[1-2][1-9]", i) for i in contents]

But I don't know how to continue.


Solution

  • A basic approach would be to use a kind of cache: go through the lines,

    • if the line starts with a date, append a new item to the cache
    • if it doesn't, append to the most recent item.
    messages = []
    for line in contents:
        if re.match(r'\d{2}/\d{2}/\d{4},\s+', line):
            messages.append([line])
        else:
            messages[-1].append(line)
     
    # messages
    [['22/05/2022, 21.58 - Name: message1 \n'],
     ['22/05/2022, 22.07 – Name2: message2\n', 'message2 continues\n'],
     ['22/05/2022, 22.09 – Name: message3\n']]
    

    You could then join them (e.g., [''.join(m) for m in messages]). Alternatively, it's also possible to build strings directly, but maybe you want to distinguish between primary/following lines at some point, then the list is more useful.