I am doing Javascript Regex to process and transform some raw data to 2D array.
Transforming raw string data to 2D array.
Here is a piece of sample with 4 entries, a new entry will go to a newline. Entry 3 comes with multiline content.
2012/12/1, AM12:21 - user1: entry1_wasehhjdsaj
2012/12/2, AM9:42 - user2: entry2_bahbahbah_dsdeead
2012/12/2, AM9:44 - user3: entry3_Line1_ContdWithFollowingLine_bahbahbah
entry3_Line2_ContdWithABoveLine_bahbahbah_erererw
entry3_Line3_ContdWithABoveLine_bahbahbah_dsff
2012/12/4, AM11:48 - user7: entry4_bahbahbah_fggf
(raw string data, without empty line. ) Updated: Sorry for misleading, the end of contents do not necessary come with same END pattern, but just a line break.
How the pattern actually ends? (Thanks @Tim Pietzcker's comment). The content should be end with a line break and following with next entry timestamp start. (You can assume the entry contents do not contain any similar timestamp pattern.)
I understand this may be a trouble regex question, so ANY OTHER JS METHOD ACHIEVING SAME GOAL WILL ALSO BE ACCEPT.
/^([0-9]{4}|[0-9]{2})[\/]([0]?[1-9]|[1][0-2])[\/]([0]?[1-9]|[1|2][0-9]|[3][0|1]), ([A|P])M([1-9]|1[0-2]):([0-5]\d) - (.*?): (.*)/gm
MATCH 1
2012
12
1
A
12
21
user1
entry1_wasehhjdsaj
MATCH 2
2012
12
2
A
9
42
user2
entry2_bahbahbah_dsdeead
MATCH 3
2012
12
2
A
9
44
user3
entry3_Line1_ContdWithFollowingLine_bahbahbah entry3_Line2_ContdWithABoveLine_bahbahbah_erererw entry3_Line3_ContdWithABoveLine_bahbahbah_dsff
MATCH 4
(to be skipped...)
There is a problem when I capture Entry 3, I can't capture the 2nd & 3rd line content of Entry 3. If the entry only contains ONE line content, the regex work fine.
How can I capture Entry 3 with Multi-line content? I try to work with m modifier, but I have no idea how to deal with Multi-line contents and newline entry at the same time.
If it is impossible achieve with js regex, please suggest another js approach to transform the raw data to 2D array as ultimate goal.
THANKS!
the end of contents do not necessary come with same END pattern, but just a line break.
Testing: https://regex101.com/r/eS9pY5/1
Here is a single regex that will match the strings you have the way you need:
^(\d{4}|\d{2})\/(0?[1-9]|1[0-2])\/(0?[1-9]|[12]\d|3[01]), ([AP])M([1-9]|1[0-2]):([0-5]\d) - (.*?): ((?:(?!(?:\d{4}|\d{2})\/(?:0?[1-9]|1[0-2])\/(?:0?[1-9]|[12]\d|3[01]))[\s\S])*)(?=\n|$)
See demo
The last capturing group is no longer a greedy dot matching .*
but a tempered greedy token (?:(?!([0-9]{4}|[0-9]{2})\/(0?[1-9]|1[0-2])\/(0?[1-9]|[12][0-9]|3[01]))[\s\S])*
matching everything up to the end of string or the date pattern.
If we unroll it to make more efficient:
^(\d{4}|\d{2})\/(0?[1-9]|1[0-2])\/(0?[1-9]|[12]\d|3[01]), ([AP])M([1-9]|1[0-2]):([0-5]\d) - (.*?): (\D*(?:\d(?!(?:\d{3}|\d)\/(?:0?[1-9]|1[0-2])\/(0?[1-9]|[12]\d|3[01]))\D*)*)(?=\n|$)
See another demo