I am currently fighting with awk to extract multiple strings from xml based files, and export those strings in a csv format
below snippet of the tags I am trying to get :
<GroupInfo Description="" Name="Site 2" Path="My Company\Site 2"/>
...
...
<PrivateInsightServerList EnableEntireServerList="1" >
<PrivateInsightServer Address="douda" LegacyClientSupport="1" Port="80" Protocol="HTTP"/>
</PrivateInsightServerList>
<PrivateInsightServerList EnableEntireServerList="1" >
<PrivateInsightServer Address="douda2" LegacyClientSupport="0" Port="443" Protocol="HTTPS"/>
</PrivateInsightServerList>
I do not know how to parse the file knowing the amount of servers from the xml can vary from 0 to N, but always are with the same structure.
Ideally, I am looking for the following in a csv, and add the N servers from the same xml file to the same line like this :
path,address,port,protocol
eg. from the snippet :
My company\site 2,douda,80,HTTP,douda2,443,HTTPS
Since it's required, and you haven't provided it, I have assumed your root XML element is simply "<root>
".
The XML isn't nicely nested (I would have expected PrivateInsightServerList
to be a child of GroupInfo
), we will need a little trickery. No matter.
First, with xmlstarlet
xml sel -t -m '/root/GroupInfo' --var groupinfo=@Path \
-m '/root/PrivateInsightServerList[@EnableEntireServerList=1]' \
-v '$groupinfo' -o "," \
-v PrivateInsightServer/@Address -o "," \
-v PrivateInsightServer/@Port -o "," \
-v PrivateInsightServer/@Protocol -nl \
input.xml
-m '/root/GroupInfo' --var groupinfo=@Path
this stores the Path attribute in a variable for later use-m '/root/PrivateInsightServerList[@EnableEntireServerList=1]'
limits the nodes selected in case EnableEntireServerList
is not 1-v ... -o ","
outputs the values we want, followed by newline (-nl
)(Instead of a variable to cache Path
you could also use a "sibling" XPath like
-v //GroupInfo/@Path
but that may not work reliably, like I said the XML doesn't appear "nice" to me.)
Since you've also tagged this with awk
, I'll assume you have a recent gawk
and gawkextlib
with the XML module (not that you can't do it in plain awk
, but it's not very productive if learning about XML parsing isn't the task at hand).
@load "xml"
XMLSTARTELEM &&
XMLPATH~/GroupInfo$/ { mypath=XMLATTR["Path"] }
XMLSTARTELEM &&
XMLPATH~/PrivateInsightServerList$/ { ok=XMLATTR["EnableEntireServerList"] }
XMLSTARTELEM &&
XMLPATH~/PrivateInsightServerList[/]PrivateInsightServer/ {
if(ok==1) printf("%s,%s,%s,%s\n",
mypath,XMLATTR["Address"],XMLATTR["Port"],XMLATTR["Protocol"])
}
This is a little primitive (I've not yet used the DOM xmltree
module), there are three blocks above, each triggers on XMLSTARTELEM
and inspects the XMLPATH
containing the full XPath of the element. The first two blocks cache the Path
and EnableEntireServerList
, the final one prints out the CSV as required.
Run with gawk -f parse.awk input.xml
(by "recent" I mean gawk-4.1 or later)
I would expect either method could have problems depending on the XML schema and ordering of data.