A text snippet like
04040p0015 Macro drive object / Macro DO SERVO, VECTOR, HLA, SERVO_AC, VECTOR_AC, SERVO_I_AC, VECTOR_I_AC, A_INF, S_INF, R_INF, B_INF, TM31, TM15DI_DO, TM120, TM150,
has to be parsed into the four groups
Where the first and second groups represent different kind of IDs, the 3rd group represents a title and the 4th group contains tags, consisting of 2 or more capital letters, followed by none or one _ , followed by none or more capital letters, followed by a comma.
([0-9]+)([rp][0-9]{4,})(.*)([A-Z]{2,}_?[A-Z,0-9]{2,},)
returns
i.e., it gets the first two groups right, but fails to correctly separate the last two groups.
What is wrong with the regex expression?
You may use this regex to get desired 4 capture groups:
(\d+)([rp]\d{4,})(.*?)\s+((?:[A-Z]\w+,\s+)+)
RegEx Details:
(\d+)
: 1st group to capture 1+ digits([rp]\d{4,})
: 2nd group to match text starting with r
or p
followed by 4+ digits(.*?)
: 3rd group to match and capture 0 or more of any characters (lazy)\s+
: 1+ whitespaces((?:[A-Z]\w+,\s+)+)
: 4th group to match & capture words starting with upper case letter and followed by comma and 1+ whitespaces