<?xml version="1.0" encoding="UTF-8"?>
<Document>
<InnerDoc>
<GrpHdr>
<MsgId>aaa.xml</MsgId>
<CreDtTm>2023-08-15T13:35:33.0Z</CreDtTm>
<MsgRcpt>
<Id value="111">
<OrgId>
<Othr>
<Id>asa-"-as'#</Id>
</Othr>
</OrgId>
</Id>
</MsgRcpt>
<tag1 info = "AddInf1">Report Map = PRIOR DAY BALTRAN INCREMENTAL " - '</tag1>
<tag2 info = "AddInf2">Report Map = " - '</tag1>
</GrpHdr>
</InnerDoc>
</Document>
In the above XML I need to find whether there is at least one occurrence of either quote ("
) or apostrophe ('
) in the XML tag value only.
For example, in
<tag1 info = "AddInf1">Report Map = PRIOR DAY BALTRAN INCREMENTAL " - '</tag1>
grep
should evaluate the string between >
and <
only.
I tried a simple special char search. But it is searching the double quotes of non-XML tag values such as in the header version="1.0"
. I don't need that, and want to avoid it.
Joachim Sauer's comment is correct - for example, even the simplest invocation on your test input yields this:
$: xmllint file
file:17: parser error : Opening and ending tag mismatch: tag2 line 17 and tag1
<tag2 info = "AddInf2">Report Map = " - '</tag1>
^
And it will make it easier to process escape codes.
With his much-appreciated assist:
$: xmllint --xpath "//text()[contains(.,'\"') or contains(., \"'\")]" file
asa-"-as'#Double only: " Single only: '
Still trying to figure a way to get the newlines and maybe line numbers.
That said, what you really want is to find records with single or double quotes in that value space.
$: grep -n $'>[^<]*[\'"][^<]*<' file
11: <Id>asa-"-as'#</Id>
16: <tag1 info = "AddInf1">Report Map = PRIOR DAY BALTRAN INCREMENTAL " - '</tag1>
17: <tag2 info = "AddInf2">Report Map = " - '</tag1>
This is going to break if the tag-delimiting characters (<
& >
) are embedded in the value space, such as <
or in a quoted string (which is questionable XML, anyway.)
Note the $'...'
construct is a Bash-ism. If that's unavailable you may need a more complicated bit of creative cross-quoting to get both correctly.
$: grep -n '>[^<]*["'"'][^<]*<" file
11: <Id>asa-"-as'#</Id>
16: <tag1 info = "AddInf1">Double only: " </tag1>
17: <tag2 info = "AddInf2">Single only: ' </tag1>
$: grep -n '>[^<]*['"'"'"][^<]*<' file
11: <Id>asa-"-as'#</Id>
16: <tag1 info = "AddInf1">Double only: " </tag1>
17: <tag2 info = "AddInf2">Single only: ' </tag1>