Search code examples

Accessing RDF/XML/OWL file nodes using Perl

I have a RDF/XML data which I'd like to parse and access the node. It looks like this:

<!-- -->

    <owl:Class rdf:about="&obo;VO_0000185">
        <rdfs:label>Influenza virus gene</rdfs:label>
        <rdfs:subClassOf rdf:resource="&obo;VO_0000156"/>

    <!-- -->

    <owl:Class rdf:about="&obo;VO_0000186">
        <rdfs:label>RNA vaccine</rdfs:label>
                <owl:intersectionOf rdf:parseType="Collection">
                    <rdf:Description rdf:about="&obo;VO_0000001"/>
                        <owl:onProperty rdf:resource="&obo;BFO_0000161"/>
                        <owl:someValuesFrom rdf:resource="&obo;VO_0000728"/>
        <rdfs:subClassOf rdf:resource="&obo;VO_0000001"/>
        <obo:IAO_0000116>Using RNA may eliminate the problem of having to tailor a vaccine for each individual patient with their specific immunity. The advantage of RNA is that it can be used for all immunity types and can be taken from a single cell. DNA vaccines need to produce RNA which then prompts the manufacture of proteins. However, RNA vaccine eliminates the step from DNA to RNA.</obo:IAO_0000116>
        <obo:IAO_0000115>A vaccine that uses RNA(s) derived from a pathogen organism.</obo:IAO_0000115>

The complete RDF/XML file can be found here.

What I want to do is to do the following:

  1. Find chunk where it contains the entry <rdfs:subClassOf rdf:resource="&obo;VO_0000001"/>
  2. Access the literal term as defined by <rdfs:label>...</rdfs:label>

So in the above example the code would go through second chunk and output: "RNA vaccine".

I'm currently stuck with the following code. Where I couldn't access the node. What's the right way to do it? Solutions other than using XML::LibXML are welcomed.

#!/usr/bin/perl -w
use strict;
use Data::Dumper;
use Carp;
use File::Basename;
use XML::LibXML 1.70;

my $filename = "VO.owl";
# Obtained from

my $parser = XML::LibXML->new();
my $doc = $parser->parse_file( $filename );

foreach my $chunk ($doc->findnodes('/owl:Class')) {
        my ($label) = $chunk->findnodes('./rdfs:label');
        my ($subclass) = $chunk->findnodes('./rdfs:subClassOf');
        print $label->to_literal;
        print $subclass->to_literal;



  • Parsing RDF as if it were XML is a folly. The exact same data can appear in many different ways. For example, all of the following RDF files carry the same data. Any conforming RDF implementation MUST handle them identically...

    <!-- example 1 -->
    <rdf:RDF xmlns:rdf="">
      <rdf:Description rdf:about="#me">
        <rdf:type rdf:resource="" />
        <foaf:name>Toby Inkster</foaf:name>
    <!-- example 2 -->
      <foaf:Person rdf:about="#me">
        <foaf:name>Toby Inkster</foaf:name>
    <!-- example 3 -->
      <foaf:Person rdf:about="#me" foaf:name="Toby Inkster" />
    <!-- example 4 -->
      <rdf:Description rdf:about="#me"
        foaf:name="Toby Inkster" />
    <!-- example 5 -->
    <rdf:RDF xmlns:rdf="">
      <rdf:Description rdf:ID="me">
          <rdf:Description rdf:about="" />
        <foaf:name>Toby Inkster</foaf:name>
    <!-- example 6 -->
        foaf:name="Toby Inkster" />

    I could easily list half a dozen other variations too, but I'll stop there. And this RDF file contains just two statements - I'm a Person; my name is "Toby Inkster" - the OP's data contains over 50,000 statements.

    And this is just the XML serialization of RDF; there are other serializations too.

    If you try handling all that with XPath, you're likely to end up becoming a lunatic locked away in a tower somewhere, muttering in his sleep about the triples; the triples...

    Luckily, Greg Williams has taken that mental health bullet for you. RDF::Trine and RDF::Query are not only the best RDF frameworks for Perl; they're amongst the best in any programming language.

    Here is how the OP's task could be achieved using RDF::Trine and RDF::Query:

    #!/usr/bin/env perl
    use v5.12;
    use RDF::Trine;
    use RDF::Query;
    my $model = 'RDF::Trine::Model'->new(
            '',  # no username
            '',  # no password
    ) unless $model->size > 0;
    my $query = RDF::Query->new(<<'SPARQL');
    PREFIX rdfs: <>
    SELECT ?super_label ?sub_label
    WHERE {
        ?sub rdfs:subClassOf ?super .
        ?sub rdfs:label ?sub_label .
        ?super rdfs:label ?super_label .
    LIMIT 5
    print $query->execute($model)->as_string;

    Sample output:

    | super_label                | sub_label                        |
    | "Aves vaccine"             | "Ducks vaccine"                  |
    | "route of administration"  | "intravaginal route"             |
    | "Shigella gene"            | "aroA from Shigella"             |
    | "Papillomavirus vaccine"   | "Bovine papillomavirus vaccine"  |
    | "virus protein"            | "Feline leukemia virus protein"  |

    UPDATE: Here's a SPARQL query that can be plugged into the script above to retrieve the data you wanted:

    PREFIX rdfs: <>
    PREFIX obo:  <>
    SELECT ?subclass ?label
    WHERE {
            rdfs:subClassOf obo:VO_0000001 ;
            rdfs:label ?label .