Search code examples
sqloracle-databaseregexp-replace

Consecutive Pattern replacing is not happening with REGEXP_REPLACE


I have a string as below

Welcome to the world of the Hackers

I am trying to replace the occurrences of listed strings i.e. of,to,the in between the entire string using below query, but it's not working properly if the patterns are consecutive, it fails.

SELECT regexp_replace( 'Welcome to the world of the Hackers', '( to )|( the )|( of )', ' ' ) 
FROM dual;

Output: Welcome the world the Hackers

Even if the pattern is repeating consecutive it is not working i.e.

SELECT regexp_replace( 'Welcome to to the world of the Hackers', '( to )|( the )|( of )', ' ' ) 
FROM dual;

Output: Welcome to world the Hackers

Whereas my expected output is: Welcome world Hackers

Is there any alternative/solution for this using REGEXP_REPLACE?


Solution

  • You can use the regular expression (^|\s+)((to|the|of)(\s+|$))+:

    SQL Fiddle

    Query 1:

    WITH test_data ( sentence ) AS (
      SELECT 'to the of' FROM DUAL UNION ALL
      SELECT 'woof breathe toto' FROM DUAL UNION ALL -- has all the words as sub-strings of words
      SELECT 'theory of the offer to total' FROM DUAL -- mix of words to replace and words starting with those words
    )
    SELECT sentence,
           regexp_replace(
             sentence,
             '(^|\s+)((to|the|of)(\s+|$))+',
             '\1'
           ) AS replaced
    FROM   test_data
    

    Results:

    |                     SENTENCE |           REPLACED |
    |------------------------------|--------------------|
    |                    to the of |             (null) | -- All words replaced
    |            woof breathe toto |  woof breathe toto |
    | theory of the offer to total | theory offer total |
    

    Why doesn't regexp_replace( 'Welcome to the world of the Hackers', '( to )|( the )|( of )', ' ' ) work with successive matches?

    Because the regular expression parser will look for the second match after the end of the first match and will not include the already parsed part of the string or the replacement text when looking for subsequent matches.

    So the first match will be:

     'Welcome to the world of the Hackers'
             ^^^^
    

    The second match will look in the sub-string following that match

     'the world of the Hackers'
               ^^^^
    

    The 'the ' at the start of the sub-string will not be matched as it has no leading space character (yes, there was a space before it but that was matched in the previous match and, yes, that match was replaced with a space but overlapping matches and matches on previous replacements are not how regular expressions work).

    So the second match is the ' of ' in the middle of the remaining sub-string.

    There will be no third match as the remaining un-parsed sub-string is:

    'the Hackers'
    

    and, again, the 'the ' is not matched as there is not leading space character to match.