Search code examples
swiftregexicu

Regex: Capture Groups and Empty Fields (SWIFT 5 | ICU Regex Engine)


I am in need of some help correcting my RegEx string - I have a string of text (A large body of HTML) and I need to take this HTML String and then pattern match it so that data that I have nested within' <div> tags can be extracted and used.

Lets take an example with a test case of <div id=1>

<div id=1>UID:1currentPartNumber:63222TRES003H1workcenter:VLCSKDcycleTime:98.8curPartCycleTime:63.66partsMade:233curCycleTimeActual:62.4target:291actual:233downtime:97statusReason:lineStatus:Productionefficiency:80.05plusminus:-260curProdTime:7/16/2019 12:28:01 PM</div>

What should be noted is that lineStatus can either have a value or be empty such as the same with statusReason

I am able to come up with a regex that does MOST of the work but I am struggling with cases where values are not present.

Here is my attempt:

(
(<div id=(\d|\d\d)>)
(UID:(\d|\d\d))
(currentPartNumber:(.{1,20}))
(workcenter:(.{1,20}))
(cycleTime:(.{1,6}))
(curPartCycleTime:(.{1,6}))
(partsMade:(.{1,6}))
(CycleTimeActual:(.{1,6}))
(target:(.{1,6}))
(actual:(.{1,6}))
(downtime:(.{1,6}))
((statusReason:((?:.)|(.{1,6}))))
((lineStatus:((?:.)|(.{1,6}))))
(Productionefficiency:(.{1,6}))
(plusminus:(.{1,6}))
(curProdTime:(.{1,30}))
)

Split it up just for readability.

Thanks,


Solution

  • You are very, very close.

    If you use:

    (
    (<div id=\d{1,2}>)
    (UID:\d{1,2})
    (currentPartNumber:(.{1,20}))
    (workcenter:(.{1,20}))
    (cycleTime:(.{1,6}))
    (curPartCycleTime:(.{1,6}))
    (partsMade:(.{1,6}))
    (CycleTimeActual:(.{1,6}))
    (target:(.{1,6}))
    (actual:(.{1,6}))
    (downtime:(.{1,6}))
    (statusReason:(.{0,6}))
    (lineStatus:(.{0,6}))
    (Productionefficiency:(.{1,6}))
    (plusminus:(.{1,6}))
    (curProdTime:(.{1,30}))
    (<\/div>)
    )
    

    Then $3\n$4\n$6\n$8\n$10\n$12\n$14\n$16\n$18\n$20\n$22\n$24\n$26\n$28\n$30 will be:

    UID:1
    currentPartNumber:63222TRES003H1
    workcenter:VLCSKD
    cycleTime:98.8
    curPartCycleTime:63.66
    partsMade:233cur
    CycleTimeActual:62.4
    target:291
    actual:233
    downtime:97
    statusReason:
    lineStatus:
    Productionefficiency:80.05
    plusminus:-260
    curProdTime:7/16/2019 12:28:01 PM
    

    By using (statusReason:(.{0,6}))(lineStatus:(.{0,6})) you make the value of statusReason and lineStatus truly optional.

    I also simplified the start <div> and UID detection.