Search code examples
sqlregexoracle-databaseregexp-replace

Replace the text within 0 or more html tags


I have a table with a column that contains varchar2 strings like the following.

My dog chases my cat
<p>My dog ate my other cat</p>
<p><div id="abcd">My cat ate my hamster</div><p>
<p><b><div id="abcd">My hamster sleeps all the time</div></b></p>

I need to do a SUBSTR on the text within the html tags.

I'm thinking that numbered groups is the way to go but I can't get the the closing tag into its own group. This is the sql I have:


WITH sentences AS 
(
SELECT 1 as nr, 'My dog chases my cat' AS ln FROM DUAL
UNION
SELECT 2, '<p>My dog ate my other cat</p>' FROM DUAL
UNION
SELECT 3,'<p><x><div id="abcd">My cat ate my hamster</div></x></p>' FROM DUAL
UNION
SELECT 4,'<p><b><div id="abcd">My hamster sleeps all the time</div></b></p>' FROM DUAL
)
SELECT nr, regexp_replace(ln, '^((<[^>]+>)+)(.*)((<[^>]+>)+)$', 'group 1:\1,group 2:\2,group 3:\3,group 4:\4', 1, 1, 'n')
FROM sentences order by nr;

RETURNS

1   My dog chases my cat
2   group 1:<p>,group 2:<p>,group 3:My dog ate my other cat,group 4:</p>
3   group 1:<p><x><div id="abcd">,group 2:<div id="abcd">,group 3:My cat ate my hamster</div></x>,group 4:</p>
4   group 1:<p><b><div id="abcd">,group 2:<div id="abcd">,group 3:My hamster sleeps all the time</div></b>,group 4:</p>

Group 4 only contains one closing tag, the other closing tag(s) is/are in group 3 as shown in row 3 and 4. What pattern do I need to get all closing tags in their own numbered group ?


Solution

  • Having said not to do this, you're actually only one character out, for these specific values anyway:

    WITH sentences AS 
    (
    SELECT 1 as nr, 'My dog chases my cat' AS ln FROM DUAL
    UNION
    SELECT 2, '<p>My dog ate my other cat</p>' FROM DUAL
    UNION
    SELECT 3,'<p><x><div id="abcd">My cat ate my hamster</div></x></p>' FROM DUAL
    UNION
    SELECT 4,'<p><b><div id="abcd">My hamster sleeps all the time</div></b></p>' FROM DUAL
    )
    SELECT nr,
      regexp_replace(ln, '^((<[^>]+>)+)(.*?)((<[^>]+>)+)$', 'group 1:\1,group 2:\2,group 3:\3,group 4:\4', 1, 1, 'n') as str
    --------------------------------------^
    FROM sentences order by nr;
    

    Without that ? making the .* non-greedy, you're including the earlier end tags in that third group, and only the final end-tag goes in group 4 because it has to.

    NR STR                                                                                                                     
    -- ------------------------------------------------------------------------------------------------------------------------
     1 My dog chases my cat                                                                                                    
     2 group 1:<p>,group 2:<p>,group 3:My dog ate my other cat,group 4:</p>                                                    
     3 group 1:<p><x><div id="abcd">,group 2:<div id="abcd">,group 3:My cat ate my hamster,group 4:</div></x></p>              
     4 group 1:<p><b><div id="abcd">,group 2:<div id="abcd">,group 3:My hamster sleeps all the time,group 4:</div></b></p>     
    

    Or just that group:

    SELECT nr, regexp_replace(ln, '^((<[^>]+>)+)(.*?)((<[^>]+>)+)$', '\3', 1, 1, 'n') as str
    FROM sentences order by nr;
    
    NR STR                           
    -- ------------------------------
     1 My dog chases my cat          
     2 My dog ate my other cat       
     3 My cat ate my hamster         
     4 My hamster sleeps all the time