Search code examples
sqlxmloraclexmltypexmltable

How to efficiently replace special characters in an XML in Oracle SQL?


I'm parsing an xml in oracle sql.

XMLType(replace(column1,'&','<![CDATA[&]]>')) //column1 is a column name that has xml data

While parsing, I'm temporarily wrapping '&' in CDATA to prevent any xml exception. After getting rid of the exception caused by '&', I'm getting "invalid character 32 (' ') found in a Name or Nmtoken". This is because of '<' character.

E.g: <child> 40 < 50 </child> // This causes the above exception.

So I tried the below and it works.

XMLType(replace(replace(column1,'&','<![CDATA[&]]>'),'< ','<![CDATA[< ]]>'))

In the above, I'm wrapping '< '(less than symbol followed by space) in CDATA. But the above is a bit time consuming. So I'm trying to use regex to reduce the time taken. Does anyone know how to implement the above action using regex in Oracle sql??

 Input : <child> 40 & < 50 </child>
 Expected Output : <child> 40 <![CDATA[&]]> <![CDATA[< ]]> 50 </child>

Note: Replacing '& ' with ampersand semicolon sometimes is leading to 'entity reference not well formed' exception. Hence I have opted to wrap in CDATA.


Solution

  • You can do that with a regexp like this:

    select regexp_replace(sr.column1,'(&|< )','<![CDATA[\1]]>') from dual;
    

    However, regexp_replace (and all the regexp_* functions) are often slower than using plain replace, because they do more complicated logic. So I'm not sure if it'll be faster or not.

    You might already be aware, but your underlying problem here is that you're starting out with invalid XML that you're trying to fix, which is a hard problem! The ideal solution is to not have invalid XML in the first place - if possible, you should escape special characters when originally generating your XML. There are built-in functions which can do that quickly, like DBMS_XMLGEN.CONVERT or HTF.ESCAPE_SC.