Search code examples
sqlregexposixdb2-400

Getting what's left after removing regex match


Context is SQL on the AS/400 (IBM i)

My goal is to end up with two values: a string determined by a regex I already have, and then everything else in the source string with the result of the regex removed and the gap (if any) closed up.

Here's the SQL:

select HAD1,                                                   
 regexp_substr(HAD1,'\b(GATE|LEVEL|DOOR|UNITS?)\s[\dA-Z]{1,}'),
 regexp_substr(HAD1,'**eventual_regex_goes_here**')                
from ECH                                                       
where regexp_like(HAD1,'\bGATE')                               

And the desired result:

Ship To                              REGEXP_SUBSTR     REGEXP_SUBSTR       
Address                                                                 
D2 COMPOUND, GATE 11                 GATE 11           D2 COMPOUND,  
2/22 GATEWAY DRIVE                   -                 2/22 GATEWAY DRIVE  
ASHBURTON FITTINGS  GATE 2           GATE 2            ASHBURTON FITTINGS
BRIERLY RD, GATE A, RIVER SIDE       GATE A            BRIERLY RD, , RIVER SIDE  
GATE 16, 37 KENEPURU DRIVE           GATE 16           , 37 KENEPURU DRIVE  

If the second expression could take the commas out too that would be great, but it's not required. The remaining string will be put through other (non-regex) processing to remove extraneous elements (phone numbers, comments, punctuation, etc.)

The closest of the posts suggested by the board software was this one, which gave the following string:

^.+?(?=\d{2})|(?<=\d{2}).+$

So, first I tried putting my whole expression in place of both occurrences of \d{2} and found that this (unsurprisingly) would not process. Then I went back to a more basic test and tried to build up from there.

Let's try just the word GATE as a constant, plus a couple of boundaries ( because deep down I'm still just a child, and you know what they say: "Children need boundaries").

select had1,                                                   
 regexp_substr(HAD1,'\b(GATE|LEVEL|DOOR|UNITS?)\s[\dA-Z]{1,}'),
 regexp_substr(HAD1,'^.+?(?=\bGATE\b)|(?<=\bGATE\b).+$')               
from ech                                                       
where regexp_like(HAD1,'\bGATE')      

Result:

Ship To                                   REGEXP_SUBSTR     REGEXP_SUBSTR                 
Address                                                                                   
GATE 3, CNR QUARRY ROAD                   GATE 3             3, CNR QUARRY ROAD           
ASHBURTON FITTINGS  GATE 2                GATE 2            ASHBURTON FITTINGS            
GATE 6, HELLABYS ROAD                     GATE 6             6, HELLABYS ROAD             
GATE 3, 548 PAKAKARIKI HILL               GATE 3             3, 548 PAKAKARIKI HILL       
GATE 5 - FLIGHTYS COMPOUND                GATE 5             5 - FLIGHTYS COMPOUND        
GATE 3 - 548 PAEKAKARIKI HILL ROAD        GATE 3             3 - 548 PAEKAKARIKI HILL ROAD
GATE 14 - TAKAPU COMPOUND                 GATE 14            14 - TAKAPU COMPOUND         
35 GATEWAY DRIVE                          -                 -                             
GATE 6                                    GATE 6             6                            
TAKAPU ROAD,GATE 20,SH1                   GATE 20            TAKAPU ROAD,

This is looking promising, keeping in mind that I'm not using the full expression for the second result column. But there is already one thing wrong.

In the second and last row there should be more data, '2' and ',SH1', respectively. And the string '35 GATEWAY DRIVE' should be in the last column. I want everything except what the expression finds (which is just the whole word GATE at the moment, remember).

It seems it can return the remaining text from one side or other of the removed text, but not both, and not all the remaining text if nothing is found to remove. So there's no point in my pressing on with adding in more sophistication to include the gate number until I understand why I'm not getting all the text that is not GATE. Therefore I'll pause here and ask for help.


Solution

  • you could try this:

    with data (s) as (values
    ('D2 COMPOUND, GATE 11'), 
    ('2/22 GATEWAY DRIVE'),
    ('ASHBURTON FITTINGS  GATE 2'),
    ('BRIERLY RD, GATE A, RIVER SIDE'),
    ('GATE 16, 37 KENEPURU DRIVE')
    ) 
    select s,
           regexp_substr(s,' ?(GATE|LEVEL|DOOR|UNITS) '),
           replace(regexp_replace(s,' ?(GATE|LEVEL|DOOR|UNITS) ',''),',',' ')
    from   data
    

    Result:

    D2 COMPOUND, GATE 11             GATE   D2 COMPOUND 11
    2/22 GATEWAY DRIVE                 -    2/22 GATEWAY DRIVE
    ASHBURTON FITTINGS  GATE 2       GATE   ASHBURTON FITTINGS 2
    BRIERLY RD, GATE A, RIVER SIDE   GATE   BRIERLY RD A  RIVER SIDE
    GATE 16, 37 KENEPURU DRIVE       GATE   16  37 KENEPURU DRIVE