Search code examples
xmlbashhp-ux

Most efficient way to parse xml and extract data into a table


Some context on what I'm trying to achieve.

Currently on a locked down HPUX box with bash and perl at my disposal however, I've got no experience with perl.

Input is a dump of hex and xml in the following format (0 to n):

MQGET of message number 1

Message Descriptor
Various Config / Params
Various Config / Params
Various Config / Params

Message

length - 3631 bytes

00000000:   3453 5675 2346 2345 2346 8679 3452 7554        '<config  params>'
00000000:   3453 5675 2346 2345 2346 8679 3452 7554        '<config  params>'
00000000:   3453 5675 2346 2345 2346 8679 3452 7554        '<config  params>'
00000000:   3453 5675 2346 2345 2346 8679 3452 7554        '<config  params>'
00000000:   3453 5675 2346 2345 2346 8679 3452 7554        '<config  params>'
00000000:   3453 5675 2346 2345 2346 8679 3452 7554        '<config  params>'
00000000:   3453 5675 2346 2345 2346 8679 3452 7554        '<config  params>'
00000000:   3453 5675 2346 2345 2346 8679 3452 7554        '<config  params>'

00000000:   3453 5675 2346 2345 2346 8679 3452 7554        '<config  params>'
00000000:   3453 5675 2346 2345 2346 8679 3452 7554        '<soapenv:Envelop'
00000000:   3453 5675 2346 2345 2346 8679 3452 7554        'e xmlns:soapenv='
00000000:   3453 5675 2346 2345 2346 8679 3452 7554        '"http://schemas.'
00000000:   3453 5675 2346 2345 2346 8679 3452 7554        '<useful_xml_data'
00000000:   3453 5675 2346 2345 2346 8679 3452 7554        '<useful_xml_data'
00000000:   3453 5675 2346 2345 2346 8679 3452 7554        '<useful_xml_data'
00000000:   3453 5675 2346 2345 2346 8679 3452 7554        '<useful_xml_data'

00000000:   3453 5675 2346 2345 2346 8679 3452 7554        '<useful_xml_data'
00000000:   3453 5675 2346 2345 2346 8679 3452 7554        '<useful_xml_data'
00000000:   3453 5675 2346 2345 2346 8679 3452 7554        '<useful_xml_data'
00000000:   3453 5675 2346 2345 2346 8679 3452 7554        '<useful_xml_data'
00000000:   3453 5675 2346 2345 2346 8679 3452 7554        '<xml_data_closin'
00000000:   3453 5675 2346 2345 2346 8679 3452 7554        'g_tag>          '

I want to end up with the following output:

1 <useful_xml_data> <specific_value> <specific_xml>
2 <useful_xml_data> <specific_value> <specific_xml>
n <useful_xml_data> <specific_value> <specific_xml>

My approach at the moment is the following:

untouchable_script_sdout | sed -n "/^[0000]/p" | cut -c59-74 | tr -d '\n'

This strips everything except the xml and removes all new line characters.

I then pass it through an xml parse script similar to this post which adds in \n when the entity equals the xml closing tag.

This leaves me with the following:

<msg1_open_tag>
<xml_tag>value
</xmltag>
<xml_tag>value
</xmltag>
....
</close_tag>

<msgn_open_tag>
<xml_tag>value
</xmltag>
<xml_tag>value
</xmltag>
</close_tag>
....

Which means I can extract the data I want using grep/awk, however I'm struggling to align the data (some of the messages might have null values).

In my head the next step would be to get the xml on one line per message:

<msg1_open_tag>  <xml_tag>value  </xmltag>  <xml_tag>value  </xmltag>    </close_tag>
<msgn_open_tag>   <xml_tag>value   </xmltag>   <xml_tag>value   </xmltag>   </close_tag>

Loop though these processing and printing as required to get a table.

However I'm struggling to get each message on to one line.

As you can no doubt tell, I'm far from a bash expert, I'm merely picking it up as I go.

Any advice or best practice pointers would be greatly appreciated.


Solution

  • Unfortunately I couldn't get the suggested sed command to work.

    After a few hours of tinkering and much Google-Fu I came up with the following:

    #par_xml is a modified version from mikeserv's answer which was linked above
    #awk 'NR%4 !=0' is to remove a duplicate value (constant on every message)
    
    par_xml.sh app_xml.out | grep –E "UsefulXML1|UsefulXML2|UsefulXML3|UsefulXML4" | grep –v "</" | awk –F'>' '{print $2}' | awk 'NR%4 !=0' | sed 'N;N;N;s/\n/ /g'
    

    And yes, I'm aware how awful this solution is... but it gets me the desired output:

    useful_xml_data1 specific_value1 specific_xml1 useful_xml_data1
    useful_xml_data2 specific_value2 specific_xml2 useful_xml_data2
    useful_xml_datan specific_valuen specific_xmln useful_xml_datan