Search code examples
regexline-breakspreg-match-all

How to use regex to include linebreaks in extracted results


I am processing a text file of messages that resembles this (though a lot longer):

13/09/18, 4:14 pm - Fred Dag: Jackie, please could you send to me too? ‚ thank you
Hello
13/09/18, 4:45 pm - Jackie Johnson: Here is yet another message
where someone added a line break
13/09/18, 4:10 pm - Fred Dag: Here is another message

The following regex works to extract the data into Date, Time, Name and Message except where the Message includes a line break:

(?<date>(?:[0-9]{1,2}\/){2}[0-9]{1,2}),\s(?<time>(?:[0-9]{1,2}:)[0-9]{2}\s[a|p]m)\s-\s(?<name>(?:.*)):\s(?<message>(?:.+))

Using preg_match_all, and the regex above, in php7.4 I have generated the following array:

Array
(
    [0] => Array
        (
            [date] => 13/09/18
            [time] => 4:14 pm
            [name] => Fred Dag
            [message] => Jackie, please could you send to me too? ‚ thank you
        )

    [1] => Array
        (
            [date] => 13/09/18
            [time] => 4:45 pm
            [name] => Jackie Johnson
            [message] => Here is yet another message
        )

    [2] => Array
        (
            [date] => 13/09/18
            [time] => 4:10 pm
            [name] => Fred Dag
            [message] => Here is another message
        )

)

But the array is missing the lines caused by the line breaks which should be appended to the previous Message. I get the same result when playing in regex101.com.

  • I tried including the single line modifier for the message like this (?<message>(?s:.+)) but that then selected everything from the start of the first message to the end of the file.
  • I tried playing with greedy vs non-greedy but I couldn't get that to work.
  • I tried using a reverse lookup, but I don't seem to have enough understanding to get that to work and ended up just randomly pasting code off the internet which did nothing but get me frustrated.

I think I have exhausted my knowledge of regex and reached the end of Google with the terms I know to search with :) Could anyone point me in the right direction?


Solution

  • Your immediate problem seems to be that the dot you are using to match the message content does not match across newlines. That can easily be fixed by using the /s dot all flag in your PHP regex. But that aside, I think your regex would also need to change. I suggest the following pattern:

    \d{2}\/\d{2}\/\d{2}, \d{1,2}:\d{1,2}.*?(?=\d{2}\/\d{2}\/\d{2}, \d{1,2}:\d{1,2}|$)
    

    This pattern matches a line from the starting date, across newlines, until reaching either the start of the next message or the end of the input.

    Sample script:

    $input = "13/09/18, 4:14 pm - Fred Dag: Jackie, please could you send to me too? ‚ thank you\nHello\n13/09/18, 4:45 pm - Jackie Johnson: Here is yet another message\nwhere someone added a line break\n13/09/18, 4:10 pm - Fred Dag: Here is another message";
    preg_match_all("/\d{2}\/\d{2}\/\d{2}, \d{1,2}:\d{1,2}.*?(?=\d{2}\/\d{2}\/\d{2}, \d{1,2}:\d{1,2}|$)/s", $input, $matches);
    print_r($matches[0]);
    

    This prints:

    Array
    (
        [0] => 13/09/18, 4:14 pm - Fred Dag: Jackie, please could you send to me too? ‚ thank you
        Hello
    
        [1] => 13/09/18, 4:45 pm - Jackie Johnson: Here is yet another message
        where someone added a line break
    
        [2] => 13/09/18, 4:10 pm - Fred Dag: Here is another message
    )