Search code examples
xmlperlxml-namespaceslibxml2xml-libxml

Remove XML namespaces with XML::LibXML


I'm converting an XML document into HTML. One of the things that needs to happen is the removal of namespaces, which cannot be legally declared in HTML (unless it's the XHTML namespace in the root tag). I have found posts from 5-10 years ago about how difficult this is to do with XML::LibXML and LibXML2, but not as much recently. Here's an example:

use XML::LibXML;
use XML::LibXML::XPathContext;
use feature 'say';

my $xml = <<'__EOI__';
<myDoc>
  <par xmlns:bar="www.bar.com">
    <bar:foo/>
  </par>
</myDoc>
__EOI__

my $parser = XML::LibXML->new();
my $doc = $parser->parse_string($xml);

my $bar_foo = do{
    my $xpc = XML::LibXML::XPathContext->new($doc);
    $xpc->registerNs('bar', 'www.bar.com');
    ${ $xpc->findnodes('//bar:foo') }[0];
};
$bar_foo->setNodeName('foo');
$bar_foo->setNamespace('','');
say $bar_foo->nodeName; #prints 'bar:foo'. Dang!

my @namespaces = $doc->findnodes('//namespace::*');
for my $ns (@namespaces){
    # $ns->delete; #can't find any such method for namespaces
}
say $doc->toStringHTML;

In this code I tried a few things that didn't work. First I tried setting the name of the bar:foo element to an unprefixed foo (the documentation says that that method is aware of namespaces, but apparently not). Then I tried setting the element namespace to null, and that didn't work either. Finally, I looked through the docs for a method for deleting namespaces. No such luck. The final output string still has everything I want to remove (namespace declarations and prefixes).

Does anyone have a way to remove namespaces, setting elements and attributes to the null namespace?


Solution

  • Here's my own gymnasticsy answer. If there is no better way, it will do. I sure wish there were a better way...

    The replace_without_ns method just copies nodes without the namespace. Any children elements that need the namespace get the declaration on them, instead. The code below moves the entire document into the null namespace:

    use strict;
    use warnings;
    use XML::LibXML;
    
    my $xml = <<'__EOI__';
    <myDoc xmlns="foo">
      <par xmlns:bar="www.bar.com" foo="bar">
        <bar:foo stuff="junk">
          <baz bar:thing="stuff"/>
          fooey
          <boof/>
        </bar:foo>
      </par>
    </myDoc>
    __EOI__
    
    my $parser = XML::LibXML->new();
    my $doc = $parser->parse_string($xml);
    
    # remove namespaces for the whole document
    for my $el($doc->findnodes('//*')){
        if($el->getNamespaces){
            replace_without_ns($el);
        }
    }
    
    # replaces the given element with an identical one without the namespace
    # also does this with attributes
    sub replace_without_ns {
        my ($el) = @_;
        # new element has same name, minus namespace
        my $new = XML::LibXML::Element->new( $el->localname );
        #copy attributes (minus namespace namespace)
        for my $att($el->attributes){
            if($att->nodeName !~ /xmlns(?::|$)/){
                $new->setAttribute($att->localname, $att->value);
            }
        }
        #move children
        for my $child($el->childNodes){
            $new->appendChild($child);
        }
    
        # if working with the root element, we have to set the new element
        # to be the new root
        my $doc = $el->ownerDocument;
        if( $el->isSameNode($doc->documentElement) ){
            $doc->setDocumentElement($new);
            return;
        }
        #otherwise just paste the new element in place of the old element
        $el->parentNode->insertAfter($new, $el);
        $el->unbindNode;
        return;
    }
    
    print $doc->toStringHTML;