Search code examples
xmlperlxml-parsingperl-modulexml-twig

In Perl, extract text from related nodes, using XML::Twig


Following is the xml file that I want to parse:

<?xml version="1.0" encoding="UTF-8"?>

<topic id="yerus5" xmlns:ditaarch="http://dita.oasis-open.org/architecture/2005/">



<title/>
  <shortdesc/>
  <body>
<p><b>CCU_CNT_ADDR: (Address=0x004 Reset=32'h1)</b><table id="table_r5b_1xj_ts">
    <tgroup cols="4">
      <colspec colnum="1" colname="col1"/>
      <colspec colnum="2" colname="col2"/>
      <colspec colnum="3" colname="col3"/>
      <colspec colnum="4" colname="col4"/>
      <tbody>
        <row>
          <entry>Field</entry>
          <entry>OFFSET</entry>
          <entry>R/W Access</entry>
          <entry>Description</entry>
        </row>
        <row>
          <entry>reg2sm_cnt</entry>
          <entry>15:0</entry>
          <entry>R/W</entry>
          <entry>Count Value to increment in the extenral memory at the specified location.
            Default Value of 1. A Count value of 0 will clear the counter value</entry>
        </row>
        <row>
          <entry>ccu2bus_endianess</entry>
          <entry>24</entry>
          <entry>R/W</entry>
          <entry>Endianess of the data structure bit</entry>
        </row></tbody>
    </tgroup>
  </table><b>CCU_STAT_ADDR: (Address=0x008 Reset=32'h0)</b><table id="table_mcc_1xj_ts">
    <tgroup cols="4">
      <colspec colnum="1" colname="col1"/>
      <colspec colnum="2" colname="col2"/>
      <colspec colnum="3" colname="col3"/>
      <colspec colnum="4" colname="col4"/>
      <tbody>
        <row>
          <entry>Field</entry>
          <entry>OFFSET</entry>
          <entry>R/W Access</entry>
          <entry>Description</entry>
        </row>
        <row>
          <entry>fifo_cnt</entry>
          <entry>1:0</entry>
          <entry>R</entry>
          <entry>Status. 0x0 indicates that the engine is free. Will be 0x1 on a write to
            address</entry>
        </row>
        <row>
          <entry>rfifo_cnt</entry>
          <entry>3:2</entry>
          <entry>R</entry>
          <entry>Status. 0x0 indicates there are no pending read values from CCU engine.</entry>
        </row> </tbody>
    </tgroup>
  </table></p>


</body>
</topic>

After running following code (Available at In Perl, XML::Simple is not able to dereference multi dimensional associative array parsed by Data::Dumper):

        use strict;
    use warnings;
    use XML::Twig;

    use Data::Dumper;

    my @headers;

    my $column_to_show = 'Field';

    sub process_row {
        my %entries;

        my ( $twig, $row ) = @_;
        my @row_entries = map { $_->text } $row->children;
        if (@headers) {
            @entries{@headers} = @row_entries;
            print $column_to_show, " => ", $entries{$column_to_show}, "\n";
        }
        else {
            @headers = @row_entries;
        }
    }

    my $twig = XML::Twig->new(
    'pretty_print' => 'indented_a',
    twig_handlers  => { 'row' => \&process_row }
)->parsefile ( 'your_file.xml' ); 

I am able to access each data of the <entry></entry>.

I am not able to extract details particularly for each <b></b> text. Yes, I am able to extract all <b></b> text. But not able to extract <row></row> for each <b></b> separately. Following is sample output:

Name: CCU_CNT_ADDR: (Address=0x004 Reset=32'h1)
Field: reg2sm_cnt 
OFFSET: 15:0 
Access: R/W 
Description: Count Value to increment in the extenral memory at the specified location. Default Value of 1. A Count value of 0 will clear the counter value 

Filed: ccu2bus_endianess 
OFFSET: 24 
Access: R/W 
Description: Endianess of the data structure bit 
 .
 .
 .
 .
 .
 .
 .
Name: CCU_STAT_ADDR: (Address=0x008 Reset=32'h0) 
Field: fifo_cnt 
.
 .
 .
 .
 .
 .
 .

I tried following but it is not working:

foreach my $b ( $twig -> get_xpath ("//b") ) # Extract text of <b></b>
{

print $b ->text, "\n";
    foreach my $row ( $twig -> get_xpath ("//row") )
    {
        print $row ->text, "\n";
    }
}

Solution

  • OK, given your example - it's actually slightly irritating, because the XML doesn't explicitly associate 'heading' with 'table' (e.g. encapsulating them within an XML node).

    However what you can do is use the prev_sibling method to get the previous element at the same level.

    #!/usr/bin/env perl
    use strict;
    use warnings;
    use XML::Twig;
    
    my $twig = XML::Twig->new->parsefile ( 'your_file.xml' );
    
    foreach my $table ( $twig->get_xpath('//table') ) {
        my $header = $table->prev_sibling->text;
        print "Name: $header\n";
        my @headers;
        foreach my $row ( $table->get_xpath("tgroup/tbody/row") ) {
            my %entries;
            my @row_entries = map { $_->text =~ s/\n\s+//rg; } $row->children;
            if (@headers) {
                @entries{@headers} = @row_entries;
                foreach my $field (@headers) {
                    print "$field: $entries{$field}\n";
                }
            }
            else {
                @headers = @row_entries;
            }
        }
        print "----\n";
    }
    

    Note - this assumes that the 'element before table' is the header. It works in your specific case, but will only work properly if there is always an element directly preceding <table> that you want to display.

    • We run a 'foreach' loop, picking out the elements called table (of which there are two in your sample.
    • Each table, we assume that the previous sibling element is the header. In this case, that's your <b> elements. Be wary of that though, as <b> denotes bold in HTML and is a formatting tag.
    • We then do basically the same thing as otherwise - for each table, decompose the rows such that we have a header and a bunch of columns, and then print them one per line.
    • As part of doing this, I use a regex to remove 'linefeed and whitespace' (s/\n\s+//gr) because the formatting on description looked a bit 'off'. Obviously you can remove that if it's undesired. (Note - this only works on newer perl versions - 5.14+ IIRC)

    This produces:

    Name: CCU_CNT_ADDR: (Address=0x004 Reset=32'h1)
    Field: reg2sm_cnt
    OFFSET: 15:0
    R/W Access: R/W
    Description: Count Value to increment in the extenral memory at the specified location.Default Value of 1. A Count value of 0 will clear the counter value
    Field: ccu2bus_endianess
    OFFSET: 24
    R/W Access: R/W
    Description: Endianess of the data structure bit
    ----
    Name: CCU_STAT_ADDR: (Address=0x008 Reset=32'h0)
    Field: fifo_cnt
    OFFSET: 1:0
    R/W Access: R
    Description: Status. 0x0 indicates that the engine is free. Will be 0x1 on a write toaddress
    Field: rfifo_cnt
    OFFSET: 3:2
    R/W Access: R
    Description: Status. 0x0 indicates there are no pending read values from CCU engine.
    ----