Search code examples
javascriptxmlpentahopentaho-data-integration

Cleaning bad XML characters in a String in Pentaho


The issue: Receiving bad XML through the web / apps / file exchanges.

I was receiving XML responses through HTTP GET that sometimes would contain bad XML characters in the text. enter image description here

That character SUB was showing up in the text and the 'Get data from XML' step would fail to read it, stating that An invalid XML character (Unicode: 0x1a) was found in the element content of the document.


Solution

  • The solution was found in this Post.

    I did not need to use the entire JavaScript function contained in the answer, just the part of characters that are not allowed to be used in an XML.

    What i did was a simple Replace() function in the 'Modified Java Script Value'

    var str = result.replace(/[^\x09\x0A\x0D\x20-\xFF\x85\xA0-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD]/gm,'');

    This has cleaned the entire XML of bad characters, and made the entire of it readable.