Search code examples
regexparsingmultiline

Get all the characters until a new date/hour is found


I have to parse a lot of content with a regular expression. The content might, for example, be:

14-08-2015 14:18 : Example : Hello =) How are you?
What are you doing?
14-08-2015 14:19: Example2 : I'm fine thanks!

I have this regular expression that will of course return 2 matches, and the groups that I need - data, hour, name, multi line message:

(\d{2}-\d{2}-\d{4})\s?(\d{2}:\d{2})\s?:([^:]+):([^\d]+)

The problem is that if a number is written inside the message this will not be OK, because the regex will stop getting more characters. For example in this case this will not work:

14-08-2015 14:18 : Example : Hello =) How are you?
What are you 2 doing?
14-08-2015 14:19: Example2 : I'm fine thanks!

How do I get all the characters until a new date/hour is found?


Solution

  • Use a lookahead for dates and get everything up to that.

    /^(\d{2}-\d{2}-\d{4})\s?(\d{2}:\d{2})\s?:([^:]+):\s?((?:(?!^\d{2}-\d{2}-\d{4}\s?\d{2}:\d{2}).)*)/sm
    

    I've edited you regex in two ways:

    1. Added ^to the front, ensuring you only start from timestamps on their own line, which should filter out most issues with people posting timestamps

    2. Replaced the last capturing group with ((?:(?!^\d{2}-\d{2}-\d{4}\s?\d{2}:\d{2}).)*)

      • (?!^\d{2}-\d{2}-\d{4}\s?\d{2}:\d{2}) is a negative lookahead, with date
      • (?:(lookahead).)* Looks for any amount of characters that aren't followed by a date anchored to the start of a line.
      • ((?:(lookahead).)*) Just captures the group for you.

    It's not that efficient, but it works. Note the s flag for dotall (dot matches newlines) and m flag that lets ^ match at the start of line. ^ is necessary in the lookahead so that you don't stop the match in case someone posts a timestamp, and in the start to make sure you only match dates from the start of a line.

    DEMO: https://regex101.com/r/rX8eH0/3
    DEMO with flags in regex: https://regex101.com/r/rX8eH0/4