Search code examples
sqloracle-databaseregexp-replace

REGEXP_REPLACE to remove list of words from company name column


select 'Owens 1245 Coning COM CO LTD INC CORP INC', REGEXP_REPLACE(
      UPPER(TRIM(REGEXP_REPLACE(TRIM(REGEXP_REPLACE(regexp_replace('Owens 1245 Coning COM CO LTD', '(*[[:punct:]])', ' '),'(\s[0-9]+\s)|(^[0-9]+\s)|(\s[0-9]+$)',' ')),'( ){2,}', ' '))), 
      '\\b'||'CO|LTD|INC|CORP'||'\\W'
      ,'') CLEARTEXT
from dual;  

if i run the above query it removes all the stop words except "CO" and gives the output : OWENS CONING COM CO

however i need the output to be like : OWENS CONING COM


Solution

  • Try this but be sure to test with every combination of "stop words" and "good" data that could exist in the column. This regex looks for your "stop words" where followed by a space or the end of the line (the space is consumed) and replaces with nothing. I added surrounding brackets to show there are no spaces after the final TRIM( )):

    SQL> with tbl(str) as (
         select 'Owens 1245 Coning COM CO LTD. INC CORP. INC' from dual
       )
       select str original, '[' || trim(regexp_replace(upper(str), '(\d+|CO|LTD|INC|CORP)([[:punct:]])?( |$)', '')) || ']' CLEARTEXT
       from tbl;
    
    ORIGINAL
    -----------------------------------------
    CLEARTEXT
    --------------------------------------------------------------------------------
    Owens 1245 Coning COM CO LTD. INC CORP. INC
    [OWENS CONING COM]
    
    
    SQL>
    

    EDIT: Modified the regex to to allow for optional punctuation after the stop words.