.net oracle-database oracle11g oracle10g oracle-sqldeveloper

Oracle Regex Substr to ignore optional group

I am working on regex_substr to get the following result in oracle

Test 1.Input: JOHN 10BC STUDENT Desired Output: JOHN

Test 2.Input: JOHN  STUDENT     Desired Output: JOHN

Test 3.Input: JOHN 10BC STU     Desired Output: JOHN

Test 4.Input: JOHN 10BC TEACHER Desired Output:NULL

Test 5.Input: JOHN  TEACHER     Desired Output:NULL

Test 6. Input: MR JOHN 08BC STU Desired Output: MR JOHN

Test 7. Input: MR JOHN STUDENT Desired Output: MR JOHN

Test 8. Input: MR JOHN 07BC TEACHER Desired Output: Null

Test 9. Input: MR STUART 06BC STDUENT Desired Output: MR STUART 

Test 10. Input: MR STUART LEE 05BC STDUENT Desired Output: MR STUART LEE

Query:

Test 1:

select REGEXP_SUBSTR('JOHN 10BC STUDENT','(.*)(\s+.*BC)?\sSTU(DENT)?',1,1,'i',1) from dual;

Output: Failed . Returned JOHN 10BC instead of JOHN

Test 2:

select REGEXP_SUBSTR('JOHN STUDENT','(.*)(\s+.*BC)?\sSTU(DENT)?',1,1,'i',1) from dual;

Output: Passed. Returned JOHN

I modified the query as below by removing the ? in the second block

Test 1:select REGEXP_SUBSTR('JOHN 10BC STUDENT','(.*)(\s+.*BC)\sSTU(DENT)?',1,1,'i',1) from dual; Output: Passed. Returned JOHN

Test 2:select REGEXP_SUBSTR('JOHN STUDENT','(.*)(\s+.*BC)\sSTU(DENT)?',1,1,'i',1) from dual; Output: Failed. Returned Null instead of JOHN.

How to ignore the middle optional word BC and always return the words till BC as a substring for STUDENT using REGEXP_SUBSTR. Thanks in Advance!!!

Solution

Use non-greedy pattern for any-character match (.*) by adding a question mark after it (.*?) to prevent it from grabbing an optional part ((\s+.*BC)?):

with test (id, col) as (
  select 1, 'JOHN 10BC STUDENT'  from dual union all
  select 2, 'JOHN  STUDENT'      from dual union all
  select 3, 'JOHN 10BC STU'      from dual union all
  select 4, 'JOHN 10BC TEACHER'  from dual union all
  select 4, 'JSTUOHN 10BC TEACHER'  from dual union all
  select 5, 'JOHN  TEACHER'      from dual
)
select
  id,
  col,
  regexp_substr(col, '(.*?)(\s+.*BC)?\sSTU(DENT)?',1,1,'i',1) as qwe
from test

ID | COL                  | QWE  
-: | :------------------- | :----
 1 | JOHN 10BC STUDENT    | JOHN 
 2 | JOHN  STUDENT        | JOHN 
 3 | JOHN 10BC STU        | JOHN 
 4 | JOHN 10BC TEACHER    | null 
 4 | JSTUOHN 10BC TEACHER | null 
 5 | JOHN  TEACHER        | null

db<>fiddle here

UPD. To allow multiple words as the first part, you may use the following regex:

(.+?)(\s+\S*BC)?\s+STU(DENT)?$

Explanation:

(.+?) - match any character one or more times with the shortest match (allowing following optional groups to be matched).
(\s+\S*BC)? - optional match for the group of zero or more non-whitespace (\S) characters followed by BC. The group should be separated from the previous non-whitespace character by one or more whitespace characters. \S* instead of .* will prevent from inclusion of multiple words into this group.
\s+STU(DENT)?$ - the last part should be a word STUDENT or it's abbreviation STU. It should be the last word in the line ($ right after it).

And the result of extended test case is:

with test (id, col) as (
  select 1, 'JOHN 10BC STUDENT'  from dual union all
  select 2, 'JOHN  STUDENT'      from dual union all
  select 3, 'JOHN 10BC STU'      from dual union all
  select 4, 'JOHN 10BC TEACHER'  from dual union all
  select 4, 'JSTUOHN 10BC TEACHER'  from dual union all
  select 5, 'JOHN  TEACHER'      from dual union all
  select 6, 'MR JOHN 10BC STU' from dual union all
  select 7, 'MR STUART ABC 10BC STUDENT' from dual union all
  select 8, 'MR STUART ABC 10BC STUDENTS' from dual union all
  select 9, 'MR STUART STU 10BC TEACHER' from dual
)
select
  id,
  col,
  regexp_substr(col, '(.*?)(\s+\S*BC)?\s+STU(DENT)?$',1,1,'i',1) as qwe
from test

ID | COL                         | QWE          
-: | :-------------------------- | :------------
 1 | JOHN 10BC STUDENT           | JOHN         
 2 | JOHN  STUDENT               | JOHN         
 3 | JOHN 10BC STU               | JOHN         
 4 | JOHN 10BC TEACHER           | null         
 4 | JSTUOHN 10BC TEACHER        | null         
 5 | JOHN  TEACHER               | null         
 6 | MR JOHN 10BC STU            | MR JOHN      
 7 | MR STUART ABC 10BC STUDENT  | MR STUART ABC
 8 | MR STUART ABC 10BC STUDENTS | null         
 9 | MR STUART STU 10BC TEACHER  | null

db<>fiddle here