Search code examples
xmlfreeswitchxmlstarlet

Why XMLStarlet replaces '>' to '>' in a string?


XMLStarlet editing by:

xmlstarlet ed -O -u "/include/X-PRE-PROCESS[@cmd='set' and starts-with(@data,'domain=')]/@data" -v 'domain=test.domain' vars.xml

on a target file:

<include>
    <X-PRE-PROCESS cmd="set" data="domain=domain.com"/>
    <X-PRE-PROCESS cmd="set" data="bong-ring=v=-7;%(100,0,941.0,1477.0);v=-7;>=2;+=.1;%(1400,0,350,440)"/>
</include>

changes necessary data="domain=domain.com" value,
but also returns unexpected (for me) change of > to &gt; in a string value bong-ring=... so >=2 becomes &gt;=2

<include>
    <X-PRE-PROCESS cmd="set" data="domain=test.domain"/>
    <X-PRE-PROCESS cmd="set" data="bong-ring=v=-7;%(100,0,941.0,1477.0);v=-7;&gt;=2;+=.1;%(1400,0,350,440)"/>
</include>

Isn't ">" protected by quotes ""?

So the question is:

Is there a bug in XMLStarlet or it's a bug in an application (Freeswitch v1.7) which uses vars.xml and parses
<X-PRE-PROCESS cmd="set" data="bong-ring=v=-7;%(100,0,941.0,1477.0);v=-7;&gt;=2;+=.1;%(1400,0,350,440)"/>
as
v=-7;%(100,0,941.0,1477.0);v=-7;&gt;=2;+=.1;%(1400,0,350,440)


Solution

  • There is nothing wrong with XMLStarlet doing this.

    The notion that > is being "protected" by the quotes is wrong. Technically > is legal in attribute values, as opposed to <, which is illegal (so is > in text node values).

    Usually tools escape the XML-reserved characters regardless of context(*), so text nodes will contain &gt; and attributes will contain &gt; as well. There is nothing wrong with this.

    However, in essence every single character in an attribute value or text node value may be escaped.

    The following is completely legal XML that is 100% equivalent to both of your samples:

    <include>
        <X-PRE-PROCESS cmd="&#x73;&#x65;&#x74;" data="&#x64;&#x6f;&#x6d;&#x61;&#x69;&#x6e;&#x3d;&#x74;&#x65;&#x73;&#x74;&#x2e;&#x64;&#x6f;&#x6d;&#x61;&#x69;&#x6e;"/>
        <X-PRE-PROCESS cmd="&#x73;&#x65;&#x74;" data="&#x62;&#x6f;&#x6e;&#x67;&#x2d;&#x72;&#x69;&#x6e;&#x67;&#x3d;&#x76;&#x3d;&#x2d;&#x37;&#x3b;&#x25;&#x28;&#x31;&#x30;&#x30;&#x2c;&#x30;&#x2c;&#x39;&#x34;&#x31;&#x2e;&#x30;&#x2c;&#x31;&#x34;&#x37;&#x37;&#x2e;&#x30;&#x29;&#x3b;&#x76;&#x3d;&#x2d;&#x37;&#x3b;&#x3e;&#x3d;&#x32;&#x3b;&#x2b;&#x3d;&#x2e;&#x31;&#x3b;&#x25;&#x28;&#x31;&#x34;&#x30;&#x30;&#x2c;&#x30;&#x2c;&#x33;&#x35;&#x30;&#x2c;&#x34;&#x34;&#x30;&#x29;"/>
    </include>
    

    It comes down to this: XML is not a string. Don't treat it as one. Don't use or create tools that treat XML as a string. XML requires a parser - and all conforming parsers will do the right thing in this situation.


    (*) From the point of view of an XML serializer: a) Generating different output for attribute values and text nodes makes the serialization process more complex without adding any value to the result. b) It's easier to write a single function for XML-escaping any string and then re-use it. c) Symmetry is easier to handle in general and programmers tend to like it.