regexregex-group

Grouping tags based on a specific pattern


A text snippet like

04040p0015 Macro drive object / Macro DO SERVO, VECTOR, HLA, SERVO_AC, VECTOR_AC, SERVO_I_AC, VECTOR_I_AC, A_INF, S_INF, R_INF, B_INF, TM31, TM15DI_DO, TM120, TM150,

has to be parsed into the four groups

  1. 04040
  2. p0015
  3. Macro drive object / Macro DO
  4. SERVO, VECTOR, HLA, SERVO_AC, VECTOR_AC, SERVO_I_AC, VECTOR_I_AC, A_INF, S_INF, R_INF, B_INF, TM31, TM15DI_DO, TM120, TM150,

Where the first and second groups represent different kind of IDs, the 3rd group represents a title and the 4th group contains tags, consisting of 2 or more capital letters, followed by none or one _ , followed by none or more capital letters, followed by a comma.

The regex

([0-9]+)([rp][0-9]{4,})(.*)([A-Z]{2,}_?[A-Z,0-9]{2,},)

returns

  1. 04040
  2. p0015
  3. Macro drive object / Macro DO SERVO, VECTOR, HLA, SERVO_AC, VECTOR_AC, SERVO_I_AC,
    VECTOR_I_AC, A_INF, S_INF, R_INF, B_INF, TM31, TM15DI_DO, TM120,
  4. TM150,

i.e., it gets the first two groups right, but fails to correctly separate the last two groups.

What is wrong with the regex expression?


Solution

  • You may use this regex to get desired 4 capture groups:

    (\d+)([rp]\d{4,})(.*?)\s+((?:[A-Z]\w+,\s+)+)
    

    RegEx Demo

    RegEx Details:

    • (\d+): 1st group to capture 1+ digits
    • ([rp]\d{4,}): 2nd group to match text starting with r or p followed by 4+ digits
    • (.*?): 3rd group to match and capture 0 or more of any characters (lazy)
    • \s+: 1+ whitespaces
    • ((?:[A-Z]\w+,\s+)+): 4th group to match & capture words starting with upper case letter and followed by comma and 1+ whitespaces