Search code examples
htmlxmlperllibxml2

Using Perl LibXML to read textContent that contains html tags


If I have the following XML:

<File id="MyTestApp/app/src/main/res/values/strings.xml">
    <Identifier id="page_title" isArray="0" isPlural="0">
        <EngTranslation eng_indx="0" goesWith="-1" index="0">My First App</EngTranslation>
        <Description index="0">Home page title</Description>
        <LangTranslation index="0">My First App</LangTranslation>
    </Identifier>
    <Identifier id="count" isArray="0" isPlural="0">
        <EngTranslation eng_indx="0" goesWith="-1" index="0">You have <b>%1$d</b> view(s)</EngTranslation>
        <Description index="0">Number of page views</Description>
        <LangTranslation index="0">You have <b>%1$d</b> view(s)</LangTranslation>
    </Identifier>     
</File>

I'm trying to read the 'EngTranslation' text value, and want to return the full value including any HTML tags. For example, I have the following:

my $parser = XML::LibXML->new;
my $dom = $parser->parse_file("test.xml") or die;

foreach my $file ($dom->findnodes('/File')) {
  print $file->getAttribute("id")."\n";
  foreach my $identifier ($file->findnodes('./Identifier')) {
      print $identifier->getAttribute("id")."\n";
      print encode('UTF-8',$identifier->findnodes('./EngTranslation')->get_node(1)->textContent."\n");
      print encode('UTF-8',$identifier->findnodes('./Description')->get_node(1)->textContent."\n");
      print encode('UTF-8',$identifier->findnodes('./LangTranslation')->get_node(1)->textContent."\n");
  }
}

The output I get is:

MyTestApp/app/src/main/res/values/strings.xml
page_title
My First App
Home page title
My First App
count
You have %1$d view(s)
Number of page views
You have %1$d views

What I'm hoping to get is:

MyTestApp/app/src/main/res/values/strings.xml
page_title
My First App
Home page title
My First App
count
You have <b>%1$d</b> view(s)
Number of page views
You have <b>%1$d</b> views

I'm just using this as an example for a more complicated situation, hopefully it makes sense.

Thanks!


Solution

  • Here's a rather monkey patching solution, but it works:

    sub XML::LibXML::Node::innerXML{
      my ($self) = shift;
      join '', $self->childNodes();
    }
    
    …
    say $identifier->findnodes('./Description')->get_node(1)->innerXML;
    

    Oh, and if the encoding becomes a problem, use the toString method, it's first argument handles encoding. (I did use open, but there were no out of range characters in the xml).

    If you don't like the monkey patching. you can change the sub to a normal one and supply the argument, like this:

    sub myInnerXML{
      my ($self) = shift;
      join '', map{$_->toString(1)} $self->childNodes();
    }
    
    …
    say myInnerXML($identifier->findnodes('./Description')->get_node(1));