Search code examples
xmlsedhtml-encode

SED - HTML-encode between specific tags in XML file


I´m currently working on a short SED script that needs to HTML-encode parts of an XML-file. The script currently looks like this:

sed.exe "/<messageData>/,/<\/messageData>/ {/<messageData>/b;/<\/messageData>/b; s/</\&lt;/g; s/>/\&gt;/g; }" %1 >%2

So basically, replace all < and > with < and > , between the and tags.

This script works perfectly well with pretty printed XML, that is

<?xml version="1.0" encoding="ISO-8859-1"?>
<Messages>
    <messageData>
        <test>DATA</test>
    </messageData>
</Messages>

comes out as

<?xml version="1.0" encoding="ISO-8859-1"?>
<Messages>
    <messageData>
        &lt;test&gt;DATA&lt;/test&gt;
    </messageData>
</Messages>

which is what I need. My issue is that the files I need process is not pretty printed, everything is on a single line, like this:

<?xml version="1.0" encoding="ISO-8859-1"?>
<Messages><messageData><test>DATA</test></messageData></Messages>

And with this format, the script no longer works. Would it be possibly to modify my script to work with both formats?

Please note that I´m not able to affect the output format, and that SED is the scripting engine to be used.

I guess I could just create another SED script that would insert a line-break after each > in the file, and after that run the script that I have created now. However I´m guessing that would not be very efficient performance wise (two passes over each file).

Any suggestions?

Regards Daniel


Solution

  • In case someone happens to stumble on the same issue, this is how we solved it. I know it´s not pretty, be it will have to do until we can use a better solution.

    sed.exe -i "s/\(>\)\(<\)/\1\n\2/g" %1
    sed.exe "/<messageData>/,/<\/messageData>/ {/<messageData>/b;/<\/messageData>/b; s/</\&lt;/g; s/>/\&gt;/g; }" %1 >%2