Search code examples
javascriptregexmultilinecapturing-group

Javascript Regex capture multiline content from each newline entry


I am doing Javascript Regex to process and transform some raw data to 2D array.

Task Briefing (JS only):

Transforming raw string data to 2D array.

Raw Data Input :

Here is a piece of sample with 4 entries, a new entry will go to a newline. Entry 3 comes with multiline content.

2012/12/1, AM12:21 - user1‬: entry1_wasehhjdsaj

2012/12/2, AM9:42 - user2‬: entry2_bahbahbah_dsdeead

2012/12/2, AM9:44 - user3‬: entry3_Line1_ContdWithFollowingLine_bahbahbah

entry3_Line2_ContdWithABoveLine_bahbahbah_erererw

entry3_Line3_ContdWithABoveLine_bahbahbah_dsff

2012/12/4, AM11:48 - user7‬: entry4_bahbahbah_fggf

(raw string data, without empty line. ) Updated: Sorry for misleading, the end of contents do not necessary come with same END pattern, but just a line break.

How the pattern actually ends? (Thanks @Tim Pietzcker's comment). The content should be end with a line break and following with next entry timestamp start. (You can assume the entry contents do not contain any similar timestamp pattern.)

I understand this may be a trouble regex question, so ANY OTHER JS METHOD ACHIEVING SAME GOAL WILL ALSO BE ACCEPT.

My current regex with capture group:

/^([0-9]{4}|[0-9]{2})[\/]([0]?[1-9]|[1][0-2])[\/]([0]?[1-9]|[1|2][0-9]|[3][0|1]), ([A|P])M([1-9]|1[0-2]):([0-5]\d) - (.*?): (.*)/gm

Desired capture result:

MATCH 1

  1. 2012
  2. 12
  3. 1
  4. A
  5. 12
  6. 21
  7. user1‬
  8. entry1_wasehhjdsaj

MATCH 2

  1. 2012
  2. 12
  3. 2
  4. A
  5. 9
  6. 42
  7. user2‬
  8. entry2_bahbahbah_dsdeead

MATCH 3

  1. 2012
  2. 12
  3. 2
  4. A
  5. 9
  6. 44
  7. user3‬
  8. entry3_Line1_ContdWithFollowingLine_bahbahbah entry3_Line2_ContdWithABoveLine_bahbahbah_erererw entry3_Line3_ContdWithABoveLine_bahbahbah_dsff

MATCH 4

(to be skipped...)


Problem:

There is a problem when I capture Entry 3, I can't capture the 2nd & 3rd line content of Entry 3. If the entry only contains ONE line content, the regex work fine.

How can I capture Entry 3 with Multi-line content? I try to work with m modifier, but I have no idea how to deal with Multi-line contents and newline entry at the same time.

If it is impossible achieve with js regex, please suggest another js approach to transform the raw data to 2D array as ultimate goal.

THANKS!

enter image description here the end of contents do not necessary come with same END pattern, but just a line break.

Testing: https://regex101.com/r/eS9pY5/1


Solution

  • Here is a single regex that will match the strings you have the way you need:

    ^(\d{4}|\d{2})\/(0?[1-9]|1[0-2])\/(0?[1-9]|[12]\d|3[01]), ([AP])M([1-9]|1[0-2]):([0-5]\d) - (.*?): ((?:(?!(?:\d{4}|\d{2})\/(?:0?[1-9]|1[0-2])\/(?:0?[1-9]|[12]\d|3[01]))[\s\S])*)(?=\n|$)
    

    See demo

    The last capturing group is no longer a greedy dot matching .* but a tempered greedy token (?:(?!([0-9]{4}|[0-9]{2})\/(0?[1-9]|1[0-2])\/(0?[1-9]|[12][0-9]|3[01]))[\s\S])* matching everything up to the end of string or the date pattern.

    If we unroll it to make more efficient:

    ^(\d{4}|\d{2})\/(0?[1-9]|1[0-2])\/(0?[1-9]|[12]\d|3[01]), ([AP])M([1-9]|1[0-2]):([0-5]\d) - (.*?): (\D*(?:\d(?!(?:\d{3}|\d)\/(?:0?[1-9]|1[0-2])\/(0?[1-9]|[12]\d|3[01]))\D*)*)(?=\n|$)
    

    See another demo