Search code examples
regexxmlcommand-linegrep

grep command to find the presence of either quote(") or apostrophe(') in the xml tag value of the file


<?xml version="1.0" encoding="UTF-8"?>
 <Document>
    <InnerDoc>
        <GrpHdr>
            <MsgId>aaa.xml</MsgId>
            <CreDtTm>2023-08-15T13:35:33.0Z</CreDtTm>
            <MsgRcpt>
                    <Id  value="111">
                    <OrgId>
                        <Othr>
                            <Id>asa-"-as'#</Id>
                        </Othr>
                    </OrgId>
                </Id>
            </MsgRcpt>
            <tag1 info = "AddInf1">Report Map = PRIOR DAY BALTRAN INCREMENTAL " - '</tag1>
            <tag2 info = "AddInf2">Report Map =  " - '</tag1>
        </GrpHdr>
    </InnerDoc>
</Document>

In the above XML I need to find whether there is at least one occurrence of either quote (") or apostrophe (') in the XML tag value only.

For example, in

<tag1 info = "AddInf1">Report Map = PRIOR DAY BALTRAN INCREMENTAL " - '</tag1>

grep should evaluate the string between > and < only.

I tried a simple special char search. But it is searching the double quotes of non-XML tag values such as in the header version="1.0". I don't need that, and want to avoid it.


Solution

  • Joachim Sauer's comment is correct - for example, even the simplest invocation on your test input yields this:

    $: xmllint file
    file:17: parser error : Opening and ending tag mismatch: tag2 line 17 and tag1
                <tag2 info = "AddInf2">Report Map =  " - '</tag1>
                                                                 ^
    

    And it will make it easier to process escape codes.

    With his much-appreciated assist:

    $: xmllint --xpath "//text()[contains(.,'\"') or contains(., \"'\")]" file
    asa-"-as'#Double only: " Single only: ' 
    

    Still trying to figure a way to get the newlines and maybe line numbers.

    That said, what you really want is to find records with single or double quotes in that value space.

    $: grep -n $'>[^<]*[\'"][^<]*<' file
    11:                            <Id>asa-"-as'#</Id>
    16:            <tag1 info = "AddInf1">Report Map = PRIOR DAY BALTRAN INCREMENTAL " - '</tag1>
    17:            <tag2 info = "AddInf2">Report Map =  " - '</tag1>
    

    This is going to break if the tag-delimiting characters (< & >) are embedded in the value space, such as &lt; or in a quoted string (which is questionable XML, anyway.)

    Note the $'...' construct is a Bash-ism. If that's unavailable you may need a more complicated bit of creative cross-quoting to get both correctly.

    $: grep -n '>[^<]*["'"'][^<]*<" file
    11:                            <Id>asa-"-as'#</Id>
    16:            <tag1 info = "AddInf1">Double only: " </tag1>
    17:            <tag2 info = "AddInf2">Single only: ' </tag1>
    
    $: grep -n '>[^<]*['"'"'"][^<]*<' file
    11:                            <Id>asa-"-as'#</Id>
    16:            <tag1 info = "AddInf1">Double only: " </tag1>
    17:            <tag2 info = "AddInf2">Single only: ' </tag1>