Search code examples
regexxmlxml2js

Remove <> symbols from CDATA XML tag using regex


Lets say I have an XML document like this:

<records>
    <record>
        <name>Jon</name>
        <surname>Doe</surname>
        <dob>2001-02-01</dob>
        <comment>
            <![CDATA[[ Patient with > 2 and < 5 siblings]]]>
        </comment>
    </record>
    <record>
        <name>Jane</name>
        <surname>Doe</surname>
        <dob>2001-02-01</dob>
        <comment>
            <![CDATA[[ Patient with > 2 siblings ]]]>
        </comment>
    </record>
</records>

I need to convert this document to a JSON object using xml2js, but I need to remove the < and > symbols for it to avoid breaking the JSON conversion process.

What I have tried

Since I understand that I need to remove these symbols before passing the XML string to the xml2js parser I have tried variations of the solutions described in the following cases:

I am successfull in matching the entire contents of the CDATA tag but not able to match the specific characters that I need to remove. This has to be accomplished in a single regex so I can pass it to the XML to JSON parser.

Any help or pointers would be greatly appreciated. Thanks in advance.

Additional Info

Adding this since the question was voted down due to lack of research evidence.

I tried modifying a regex rule I found in one of the references I mentioned. This is the rule.

\[CDATA\[(.*?)\]\]>`

This matches the entire contents of teh CDATA tag. This is helpful, but what I need to to replace/remove content within the CDATA tags. Here is how it looks on the regex editor.

enter image description here

I then proceeded to modify the rule to match either < or > Here is the rule that I tried.

\[CDATA\[(.*?)[<>]*\]\]>

This rule matches the following content (not just the <> signs).

    [ Patient with > 2 and < 5 siblings]

Here is how it looks on the regex editor.

enter image description here

I hope this give more clarity about what I am trying to accomplish.

Edit 2:

Here is the error triggered by the code. The relevant error message is invalid closing tag.

enter image description here

Here is line 38 of import.js as referenced in the error trace.

const jsonXml = await parseStringPromise(xml).then((res) => res);

This line uses xml2js to parse the XML document and convert it to a JSON object. Because the CTAG contains the <> symbols, I assume that the parser thinks it is part of an XML tag that is not closed properly.


Solution

  • In JavaScript, as it is the language you are using to code, you can use

    const text = `<comment>
       <![CDATA[[ Patient with > 2 and < 5 siblings]]]>
    </comment>`
    const re = /\[CDATA\[\[[^]*?]]>/g
    console.log( text.replace(re, (x) => x.replace(/[<>]/g, '')) )

    The \[CDATA\[\[[^]*?]]> pattern (see its demo) matches all CDATA blocks, even if they span multiple lines because

    • \[CDATA\[\[ matches [CDATA[[ substrings
    • [^]*? matches zero or more chars as few as possible
    • ]]> matches ]]>.

    Then, once the match is found, all < and > are removed from these matched texts with x.replace(/[<>]/g, '').