Search code examples
sqlregexoracle-databaseregex-group

Oracle SQL regexp_substr non-capturing/optional group


The expression:

Reassigning definition: (\d+) from: \[(\d+)\] to: \[(\d+)\].+?\.(?: Target definition = (\d+))?.*

correctly produces the following matches:

Group 1.    24-30   494801
Group 2.    38-45   8280955
Group 3.    52-59   8336297
Group 4.    103-109 494767

for the input string:

Reassigning definition: 494801 from: [8280955] to: [8336297], advancing due dates. Target definition = 494767.

and the first 3 matches for the input string:

Reassigning definition: 494801 from: [8280955] to: [8336297], advancing due dates.

with JavaScript, Python, PHP, and GoLang flavors (see https://regex101.com/r/Br66wm/3), but does not with SQL regexp-substr:

with
  input_string as
  (
    select 'Reassigning definition: 494801 from: [8280955] to: [8336297], advancing dates. Target definition = 494767.' as test_string from dual
    union all
    select 'Reassigning definition: 494801 from: [8280955] to: [8336297], advancing dates.' as test_string from dual
   ),
   pattern_string as
   (
     select 'Reassigning definition: (\d+) from: \[(\d+)\] to: \[(\d+)\].+?\.(?: Target definition = (\d+))?.*$' as pattern_string from dual
   )
select
  regexp_substr(i.test_string, p.pattern_string, 1, 1, null, 1) as group_1,
  regexp_substr(i.test_string, p.pattern_string, 1, 1, null, 2) as group_2,
  regexp_substr(i.test_string, p.pattern_string, 1, 1, null, 3) as group_3,
  regexp_substr(i.test_string, p.pattern_string, 1, 1, null, 4) as group_4
from
  input_string i, pattern_string p;

where the 4th group is always null. What's wrong with my usage of non-capturing groups? Basically, the following sentence is optional in my input test strings:

 Target definition = 494767.

Solution

  • This is a little bit too much to put into a comment so I'll write it here. If it doesn't make sense, I'll just remove it.

    If you're always looking for digits in those strings (not related to what surrounds them), then it could be simplified to

    SQL> with
      2    input_string as
      3    (
      4      select 'Reassigning definition: 494801 from: [8280955] to: [8336297], advancing dates. Target definition = 494767.' as test_string from dual
      5      union all
      6      select 'Reassigning definition: 494801 from: [8280955] to: [8336297], advancing dates.' as test_string from dual
      7     )
      8  select regexp_substr(test_string, '\d+', 1, 1) grp1,
      9         regexp_substr(test_string, '\d+', 1, 2) grp2,
     10         regexp_substr(test_string, '\d+', 1, 3) grp3,
     11         regexp_substr(test_string, '\d+', 1, 4) grp4
     12  from input_string;
    
    GRP1       GRP2       GRP3       GRP4
    ---------- ---------- ---------- ----------
    494801     8280955    8336297    494767
    494801     8280955    8336297
    
    SQL>
    

    Or, option which doesn't have fixed number of groups (though, layout is different from what you wanted):

    SQL> with
      2    input_string as
      3    (
      4      select 'Reassigning definition: 494801 from: [8280955] to: [8336297], advancing dates. Target definition = 494767.' as test_string from dual
      5      union all
      6      select 'Reassigning definition: 494801 from: [8280955] to: [8336297], advancing dates.' as test_string from dual
      7     )
      8  select column_value grp_rn,
      9         regexp_substr(test_string, '\d+', 1, column_value) grp
     10  from input_String cross join
     11    table(cast(multiset(select level from dual
     12                        connect by level <= regexp_count(test_string, '\d+')
     13                       ) as sys.odcinumberlist));
    
     GRP_RN GRP
    ------- ----------
          1 494801
          2 8280955
          3 8336297
          4 494767
          1 494801
          2 8280955
          3 8336297
    
    7 rows selected.