Search code examples
xmlawkdata-extractionxmlstarlet

Extract N strings from an xml based file


I am currently fighting with awk to extract multiple strings from xml based files, and export those strings in a csv format

below snippet of the tags I am trying to get :

    <GroupInfo Description="" Name="Site 2" Path="My Company\Site 2"/>
    ...
    ...
    <PrivateInsightServerList EnableEntireServerList="1" >
      <PrivateInsightServer Address="douda" LegacyClientSupport="1" Port="80" Protocol="HTTP"/>
    </PrivateInsightServerList>
    <PrivateInsightServerList EnableEntireServerList="1" >
      <PrivateInsightServer Address="douda2" LegacyClientSupport="0" Port="443" Protocol="HTTPS"/>
    </PrivateInsightServerList>

I do not know how to parse the file knowing the amount of servers from the xml can vary from 0 to N, but always are with the same structure.

Ideally, I am looking for the following in a csv, and add the N servers from the same xml file to the same line like this :

path,address,port,protocol

eg. from the snippet :

My company\site 2,douda,80,HTTP,douda2,443,HTTPS

Solution

  • Since it's required, and you haven't provided it, I have assumed your root XML element is simply "<root>".

    The XML isn't nicely nested (I would have expected PrivateInsightServerList to be a child of GroupInfo), we will need a little trickery. No matter.

    First, with xmlstarlet

    xml sel -t -m '/root/GroupInfo' --var groupinfo=@Path \
      -m '/root/PrivateInsightServerList[@EnableEntireServerList=1]' \
        -v '$groupinfo' -o "," \
        -v PrivateInsightServer/@Address -o "," \
        -v PrivateInsightServer/@Port -o "," \
        -v PrivateInsightServer/@Protocol -nl \
      input.xml
    
    • -m '/root/GroupInfo' --var groupinfo=@Path this stores the Path attribute in a variable for later use
    • -m '/root/PrivateInsightServerList[@EnableEntireServerList=1]' limits the nodes selected in case EnableEntireServerList is not 1
    • each -v ... -o "," outputs the values we want, followed by newline (-nl)

    (Instead of a variable to cache Path you could also use a "sibling" XPath like -v //GroupInfo/@Path but that may not work reliably, like I said the XML doesn't appear "nice" to me.)

    Since you've also tagged this with awk, I'll assume you have a recent gawk and gawkextlib with the XML module (not that you can't do it in plain awk, but it's not very productive if learning about XML parsing isn't the task at hand).

    @load "xml"
    
    XMLSTARTELEM &&
      XMLPATH~/GroupInfo$/ { mypath=XMLATTR["Path"] }
    XMLSTARTELEM &&
      XMLPATH~/PrivateInsightServerList$/ { ok=XMLATTR["EnableEntireServerList"] }
    XMLSTARTELEM &&
       XMLPATH~/PrivateInsightServerList[/]PrivateInsightServer/ {
         if(ok==1) printf("%s,%s,%s,%s\n",
                   mypath,XMLATTR["Address"],XMLATTR["Port"],XMLATTR["Protocol"])
    }
    

    This is a little primitive (I've not yet used the DOM xmltree module), there are three blocks above, each triggers on XMLSTARTELEM and inspects the XMLPATH containing the full XPath of the element. The first two blocks cache the Path and EnableEntireServerList, the final one prints out the CSV as required.

    Run with gawk -f parse.awk input.xml (by "recent" I mean gawk-4.1 or later)

    I would expect either method could have problems depending on the XML schema and ordering of data.