How do I validate an XML document using
XML::LibXML
when the DTD is available over HTTPS?
#!/usr/bin/perl -w
use XML::LibXML;
use strict;
my $xml = XML::LibXML->load_xml(IO => \*DATA);
my $dtd = XML::LibXML::Dtd->new( "-//NLM//DTD LinkOut 1.0//EN", "https://www.ncbi.nlm.nih.gov/projects/linkout/doc/LinkOut.dtd" );
my $https_is_valid = $xml->is_valid( $dtd );
print "HTTPS dtd: ", ref $dtd, "\n Is valid: $https_is_valid\n";
my $dtd_http = XML::LibXML::Dtd->new( "-//NLM//DTD LinkOut 1.0//EN", "http://www.ncbi.nlm.nih.gov/projects/linkout/doc/LinkOut.dtd" );
my $http_is_valid = $xml->is_valid( $dtd_http );
print "HTTP dtd: ", ref $dtd_http, "\n Is valid: $http_is_valid\n";
__DATA__
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE LinkSet PUBLIC "-//NLM//DTD LinkOut 1.0//EN" "https://www.ncbi.nlm.nih.gov/projects/linkout/doc/LinkOut.dtd" [
<!ENTITY base.url "https://some.domain.com">
<!ENTITY icon.url "https://some.domain.com/logo.png">
]>
<LinkSet>
<Link>
<LinkId>1</LinkId>
<ProviderId>XXXX</ProviderId>
<IconUrl>&icon.url;</IconUrl>
<ObjectSelector>
<Database>PubMed</Database>
<ObjectList>
<ObjId>1234567890</ObjId>
</ObjectList>
</ObjectSelector>
<ObjectUrl>
<Base>&base.url;</Base>
<Rule>/1/</Rule>
</ObjectUrl>
</Link>
</LinkSet>
The code above produces the following output:
HTTPS dtd:
Is valid: 0
HTTP dtd: XML::LibXML::Dtd
Is valid: 1
The DTD fails to load from the HTTPS URL, and therefore cannot be used to validate the XML.
I've downloaded the DTD over HTTPS and checked for HTTP redirects - there aren't any.
I've also had a look at
XML::LibXML::InputCallback
but can't see how I can incorporate it with XML::LibXML::Dtd->new( ... );
.
How should I implement this validation?
The DTD is available over HTTP so I could just use that to validate, but this feels like I'm avoiding the problem rather than solving it properly!
Note that the XML already contains the URL to the DTD, so you don't need to create a XML::LibXML::Dtd
to pass to ->is_valid
.
I agree with commenter Grant McLean that you might not want to go out on the network all the time. In fact, a while back I wrote some code that used a XML::LibXML::InputCallback
to redirect all network requests to the local FS where I had cached network resources.
But to answer your question, it wasn't too difficult to adapt that code to fetch from the network, including HTTPS, via HTTP::Tiny
, which needs IO::Socket::SSL
>=1.56 and Net::SSLeay
>=1.49 installed for SSL support. The following prints the expected "Is valid: yes
":
use warnings;
use strict;
use XML::LibXML;
use HTTP::Tiny;
use URI;
my $parser = XML::LibXML->new;
my $cb = XML::LibXML::InputCallback->new;
my $http = HTTP::Tiny->new;
my %cache;
$cb->register_callbacks([
sub { 1 }, # match (URI), returns Bool
sub { # open (URI), returns Handle
my $uri = URI->new($_[0]);
my $file;
#warn "Handling <<$uri>>\n"; #Debug
if (!$uri->scheme) { $file = $_[0] }
elsif ($uri->scheme eq 'file') { $file = $uri->path }
elsif ($uri->scheme=~/\Ahttps?\z/i) {
if (!defined $cache{$uri}) {
my $resp = $http->get($uri);
die "$uri: $resp->{status} $resp->{reason}\n"
unless $resp->{success};
$cache{$uri} = $resp->{content};
}
$file = \$cache{$uri};
}
else { die "unsupported URL scheme: ".$uri->scheme }
open my $fh, '<', $file or die "$file: $!";
return $fh;
},
sub { # read (Handle,Length), returns Data
my ($fh,$len) = @_;
read($fh, my $buf, $len);
return $buf;
},
sub { close shift } # close (Handle)
]);
$parser->input_callbacks($cb);
my $doc = $parser->load_xml( IO => \*DATA );
print "Is valid: ", $doc->is_valid ? "yes" : "no", "\n";
__DATA__
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE LinkSet PUBLIC "-//NLM//DTD LinkOut 1.0//EN" "https://www.ncbi.nlm.nih.gov/projects/linkout/doc/LinkOut.dtd" [
<!ENTITY base.url "https://some.domain.com">
<!ENTITY icon.url "https://some.domain.com/logo.png">
]>
<LinkSet>
<Link>
<LinkId>1</LinkId>
<ProviderId>XXXX</ProviderId>
<IconUrl>&icon.url;</IconUrl>
<ObjectSelector>
<Database>PubMed</Database>
<ObjectList>
<ObjId>1234567890</ObjId>
</ObjectList>
</ObjectSelector>
<ObjectUrl>
<Base>&base.url;</Base>
<Rule>/1/</Rule>
</ObjectUrl>
</Link>
</LinkSet>