Search code examples
oracle-databasepattern-matchingregexp-replacesearch-multiple-words

How to pattern match multiple words and replace it in Oracle?


There is a table with columns word and sentence. I am trying to replace the words in a sentence if the word exits in "words" column.

Tried the below code but it works only for single word. But I need to replace multiple terms if it exits in the word column.

Create table temp(id NUMBER,
word VARCHAR2(1000),
Sentence VARCHAR2(2000));

insert into temp(1,'automation testing','automation testing is popular kind of testing');
insert into temp(2,'testing','manual testing');
insert into temp(3,'manual testing','this is an old method of testing');

BEGIN
for t1 in (select id, word from temp)
LOOP
    for t2 in (select rownum from temp where sentence is not null)
    LOOP
        update temp 
        set sentence = REPLACE(sentence, t1.word,t1.id)
        where rownum = rownum;
    END LOOP;
END LOOP;
END;

But I need to replace multiple terms if it exits in the word column.

Expected outcome:

id word                   sentence
1  automation testing     1 is popular kind of 2
2  testing                3
3  manual testing         this is an old method of 2
Updated code:

MERGE INTO temp dst
USING (
  WITH ordered_words ( rn, id, word, regex_safe_word ) AS (
    SELECT ROW_NUMBER() OVER ( ORDER BY LENGTH( word ) ASC, word DESC ),
           id,
           word,
           REGEXP_REPLACE( word, '([][)(}{|^$\.*+?])', '\\\1' )
    FROM   temp
  ),
  sentences_with_ids ( rid, sentence, rn ) AS (
    SELECT ROWID,
           sentence,
           ( SELECT COUNT(*) + 1 FROM ordered_words )
    FROM   temp
  UNION ALL
    SELECT s.rid,
           REGEXP_REPLACE(
             REGEXP_REPLACE(
               s.sentence,
               '(^|\W)' || w.regex_safe_word || '($|\W)',
               '\1${'|| w.id ||'}\2'
              ),
             '(^|\W)' || w.regex_safe_word || '($|\W)',
             '\1${' || w.id || '}\2'
           ),
           s.rn - 1
    FROM   sentences_with_ids s
           INNER JOIN ordered_words w
           ON ( s.rn - 1 = w.rn ) 
  ),
  sentences_with_words ( rid, sentence, rn ) AS (
    SELECT rid,
           sentence,
           ( SELECT COUNT(*) + 1 FROM ordered_words )
    FROM   sentences_with_ids
    WHERE  rn = 1
  UNION ALL
    SELECT s.rid,
           REPLACE(
             s.sentence,
             '${' || w.id || '}',
             'http://localhost/' || w.id || '/<u>' || w.word || '</u>'
           ),
           s.rn - 1
    FROM   sentences_with_words s
           INNER JOIN ordered_words w
           ON ( s.rn - 1 = w.rn ) 
  )
  SELECT rid, sentence
  FROM   sentences_with_words
  WHERE  rn = 1
) src
ON ( dst.ROWID = src.RID )
WHEN MATCHED THEN
  UPDATE
  SET    sentence = src.sentence;

Can we improve the performance of the above updated query?


Solution

  • 2019-09-13 13:51 UTC - I see what you are getting at now. You want to replace not strings but words, and the new string will include the word(s). So you are doing a first series of replacements, then a second series.

    To make this go faster, I would still do a JOIN of the words to the sentences in which they are found. Here is the SELECT part of the solution (the bit that goes into the USING clause) so you can check what goes on and see some results quicker. Some words of explanation after the code.

    with words(id, word, word_length, search1, replace1, search2, replace2) as (
      select id, word, length(word),
      '(^|\W)' || REGEXP_REPLACE(word, '([][)(}{|^$\.*+?])', '\\\1') || '($|\W)',
      '\1{'|| id ||'}\2',
      '{'|| id ||'}',
      'http://localhost/' || id || '/<u>' || word || '</u>'
      FROM temp
    )
    , joined_data as (
      select w.search1, w.replace1, w.search2, w.replace2,
        s.rowid s_rid, s.sentence,
        row_number() over(partition by s.rowid order by word_length desc) rn
      from words w
      join temp s
      on instr(s.sentence, w.word) > 0
      and regexp_like(s.sentence, w.search1)
    )
    , unpivoted_data as (
      select S_RID, SENTENCE, PHASE, SEARCH_STRING, REPLACE_STRING,
        row_number() over(partition by s_rid order by phase, rn) rn,
        case when row_number() over(partition by s_rid order by phase, rn)
          = count(*) over(partition by s_rid)
          then 1
          else 0
        end is_last
      from joined_data
      unpivot(
        (search_string, replace_string) 
        for phase in ( (search1, replace1) as 1, (search2, replace2) as 2 ))
    )
    , replaced_data(S_RID, RN, is_last, SENTENCE) as (
      select S_RID, RN, is_last,
        regexp_replace(SENTENCE, search_string, replace_string)
      from unpivoted_data
      where rn = 1
      union all
      select n.S_RID, n.RN, n.is_last,
        case when n.phase = 1
          then regexp_replace(o.SENTENCE, n.search_string, n.replace_string)
          else replace(o.SENTENCE, n.search_string, n.replace_string)
        end
      from unpivoted_data n
      join replaced_data o
        on o.s_rid = n.s_rid and n.rn = o.rn + 1  
    )
    select s_rid, sentence from replaced_data
    where is_last = 1
    order by s_rid;
    

    The WORDS subquery gets the word length and the "search" and "replace" strings for both series of replacements.

    JOINED_DATA joins the words to the sentences. I do INSTR first and do REGEXP only when needed because it costs more CPU.

    UNPIVOTED_DATA splits the rows into two phases: the first replaces "automation testing" by "{1}" and the second replaces "{1}" by "http://localhost/1/automation testing". The rows are assigned the right sequence, and the "last" row for each sentence is identified.

    REPLACED_DATA does either REGEXP_REPLACE or REPLACE depending on the phase. In the second phase, REPLACE is enough.

    Regards, Stew