Grouping tags based on a specific pattern

A text snippet like

04040p0015 Macro drive object / Macro DO SERVO, VECTOR, HLA, SERVO_AC, VECTOR_AC, SERVO_I_AC, VECTOR_I_AC, A_INF, S_INF, R_INF, B_INF, TM31, TM15DI_DO, TM120, TM150,

has to be parsed into the four groups

04040
p0015
Macro drive object / Macro DO
SERVO, VECTOR, HLA, SERVO_AC, VECTOR_AC, SERVO_I_AC, VECTOR_I_AC, A_INF, S_INF, R_INF, B_INF, TM31, TM15DI_DO, TM120, TM150,

Where the first and second groups represent different kind of IDs, the 3rd group represents a title and the 4th group contains tags, consisting of 2 or more capital letters, followed by none or one _ , followed by none or more capital letters, followed by a comma.

The regex

([0-9]+)([rp][0-9]{4,})(.*)([A-Z]{2,}_?[A-Z,0-9]{2,},)

returns

04040
p0015
Macro drive object / Macro DO SERVO, VECTOR, HLA, SERVO_AC, VECTOR_AC, SERVO_I_AC,
VECTOR_I_AC, A_INF, S_INF, R_INF, B_INF, TM31, TM15DI_DO, TM120,
TM150,

i.e., it gets the first two groups right, but fails to correctly separate the last two groups.

What is wrong with the regex expression?

Solution

You may use this regex to get desired 4 capture groups:

(\d+)([rp]\d{4,})(.*?)\s+((?:[A-Z]\w+,\s+)+)

RegEx Demo

RegEx Details:

(\d+): 1st group to capture 1+ digits
([rp]\d{4,}): 2nd group to match text starting with r or p followed by 4+ digits
(.*?): 3rd group to match and capture 0 or more of any characters (lazy)
\s+: 1+ whitespaces
((?:[A-Z]\w+,\s+)+): 4th group to match & capture words starting with upper case letter and followed by comma and 1+ whitespaces