Search code examples
htmloracle-databaseplsql

How to extract HTML structure from string?


I have a long text written in HTML:

<body>
    <h2>title 1</h2>
    <h2>This is an <b>important</b> title</h2>
        Some text
        <h3>This a subtitle b</h3>
        <h3>This is also <span style="font-weight:500">important</span></h3>        
</body>

I need to extract the titles from it to create a Table of Content. I expect the result as:

h2      Title 1
h2      This is an <b>important</b> title
h3      This a subtitle b
h3      This is also <span style="font-weight:500">important</span>

or:

h2      Title 1
h2      This is an important title
h3      This a subtitle b
h3      This is also important

I tried:

select * from xmltable('body/*'  passing xmltype('<body><h2>title 1</h2><h2>This is an <b>important</b> title</h2>Some text<h3>This a subtitle b</h3><h3>This is also <span style="font-weight:500">important</span></h3></body>') 
columns 
tag_name varchar2(1000) path 'name()',
tag_value varchar2(1000) path 'text()')
where tag_name in ('h1','h2','h3','h4','h5')

But I'm getting the error:

ORA-19279: XPTY0004 - XQuery dynamic type mismatch: expected singleton sequence - got multi-item sequence
19279. 00000 -  "XPTY0004 - XQuery dynamic type mismatch: expected singleton sequence - got multi-item sequence" 
*Cause:    The XQuery sequence passed in had more than one item.
*Action:   Correct the XQuery expression to return a single item sequence.  

How can I solve that?


Solution

  • Would this work as the output?

    TAG_NAME TAG_VALUE                                                             
    -------- ----------------------------------------------------------------------
    h2       <h2>title 1</h2>                                                      
    h2       <h2>This is an <b>important</b> title</h2>                            
    h3       <h3>This a subtitle b</h3>                                            
    h3       <h3>This is also <span style="font-weight:500">important</span></h3>  
    

    This makes more sense to me - then let whatever xml tools you work with interpret the tag values as needed. (They may require that the tag values be xmltype data type - if so, then simply remove the xmlserialize wrapper in the select clause.)

    If that's acceptable, you can get it with a small modification of your query.

    select tag_name, xmlserialize(document tag_value) as tag_value
    from xmltable('body/*'  passing xmltype('<body><h2>title 1</h2>
      <h2>This is an <b>important</b> title</h2>Some text<h3>This a subtitle b</h3>
      <h3>This is also <span style="font-weight:500">important</span></h3></body>') 
    columns 
      tag_name varchar2(1000) path 'name()',
      tag_value xmltype path '.')
    where tag_name in ('h1','h2','h3','h4','h5')
    ;