linuxsedcommand-linepattern-matching

Sed command to replace "(double quote) to " and ' (single quote) to ' in all xml tag value in a file


<?xml version="1.0" encoding="UTF-8"?>
<Document>
    <InnerDoc>
        <GrpHdr>
            <MsgId>aaa.xml</MsgId>
            <CreDtTm>2023-08-15T13:35:33.0Z</CreDtTm>
            <MsgRcpt>
                    <Id  value="111">
                    <OrgId>
                        <Othr>
                            <Id>asa-"-as'#</Id>
                        </Othr>
                    </OrgId>
                </Id>
            </MsgRcpt>
            <tag1 info = "AddInf1">Report Map = PRIOR DAY BALTRAN INCREMENTAL " - '</tag1>
            <tag2 info = "AddInf2">Report Map =  " - '</tag1>
        </GrpHdr>
    </InnerDoc>
</Document>

For the above xml I need to replace all " (double quote) to &quot; and ' (single quote) to &apos; for eg: <tag1 info = "AddInf1">Report Map = PRIOR DAY BALTRAN INCREMENTAL &quot; - &apos</tag1> It should replace for text only in xml tag value.So, It should match text between pattern > and <. could you please suggest correct sed command for this?

I tried sed command to replace but it's replacing all. I need to do pattern match and consider only text within > and < for replacing


Solution

  • Using GNU awk for multi-char RS and RT:

    $ awk -v RS='>[^<]+<' -v ORS= '{
        gsub(/"/,"\\&quot;",RT)
        gsub(/\047/,"\\&apos;",RT)
        print $0 RT
    }' file
    <?xml version="1.0" encoding="UTF-8"?>
    <Document>
        <InnerDoc>
            <GrpHdr>
                <MsgId>aaa.xml</MsgId>
                <CreDtTm>2023-08-15T13:35:33.0Z</CreDtTm>
                <MsgRcpt>
                        <Id  value="111">
                        <OrgId>
                            <Othr>
                                <Id>asa-&quot;-as&apos;#</Id>
                            </Othr>
                        </OrgId>
                    </Id>
                </MsgRcpt>
                <tag1 info = "AddInf1">Report Map = PRIOR DAY BALTRAN INCREMENTAL &quot; - &apos;</tag1>
                <tag2 info = "AddInf2">Report Map =  &quot; - &apos;</tag1>
            </GrpHdr>
        </InnerDoc>
    </Document>
    

    It's obviously fragile as > or < might appear in text or within tag attributes.