Search code examples
semantic-weblinked-datatriplestore

How do triple stores use linked data?


Lets say I have the following scenario:

I have some different ontology files hosted somewhere on the web on different domains like _http://foo1.com/ontolgy1.owl#, _http://foo2.com/ontology2.owl# etc.

I also have a triple store in which I want to insert instances based on the ontology files mentioned like this:

INSERT DATA
{
  <http://foo1.com/instance1> a <http://foo1.com/ontolgy1.owl#class1>.
  <http://foo2.com/instance2> a <http://foo2.com/ontolgy2.owl#class2>.
  <http://foo2.com/instance2x> a <http://foo2.com/ontolgy2.owl#class2x>.
}

Lets say that _http://foo2.com/ontolgy2.owl#class2x is a subclass of _http://foo2.com/ontolgy2.owl#class2 defined within the same ontology.

And after the insert if I run a SPARQL query like this:

select ?a
where 
{
  ?a rdf:type ?type.
  ?type  rdfs:subClassOf* <http://foo2.com/ontolgy2.owl#class2> .
}

the result would be:

<http://foo2.com/instance2>

and not:

<http://foo2.com/instance2>
<http://foo2.com/instance2x>

as it should be. This is happening because the ontology file _http://foo2.com/ontolgy2.owl# is not imported into the triple store.

My question is:

Can we talk in this example about "linked" data? Because it seems to me that it is not linked at all. It has to be imported locally into a triple store, and after that you can start querying.

Lets say if you want to run a query on some complex data that is described by 20 ontology files, then all 20 ontology files would need to be imported.

Isn't this a bit disappointing?

Do I misunderstood triple stores and linked data and how they work together?


Solution

  • as it should be.

    I'm not certain that should is the right term here. The semantics of the SPARQL query is to query the data stored in a particular graph stored at the endpoint. IRIs are more or less opaque identifiers; just because they might also be URLs from which additional data can be retrieved doesn't obligate any particular system to actually do that kind of retrieval. Doing that would easily make query behavior unpredictable: "this query worked yesterday, why doesn't it work today? oh, a remote website is no longer available…".

    Lets say that _http://foo2.com/ontolgy2.owl#class2x is a subclass of _http://foo2.com/ontolgy2.owl#class2 defined within the same ontology.

    Remember, since IRIs are opaque, anyone can define a term in any ontology. It's always possible for someone else to come along and say something else about a resource. You have no way of tracking all that information. For instance, if I go and write an ontology, I can declare http://foo2.com/ontolgy2.owl#class2x as a class and assert that it's equivalent to http://dbpedia.org/ontology/Person. Should the system have some way to know about what I did someplace else, and even if it did, should it be required to go and retrieve information from it? What if I made an ontology that's 2GB in size? Surely your endpoint can't be expected to go and retrieve that just to answer a quick query?

    Can we talk in this example about "linked" data? Because it seems to me that it is not linked at all. It has to be imported locally into a triple store, and after that you can start querying.

    Lets say if wan to run a query on some complex data that is describe by 20 ontology files, in this case I have to import all 20 ontology files.

    This is usually the case, and the point about linked data is that you have a way to get more information if you choose to, and that you don't have to do as much work in negotiating how to identify resources in that data. However, you can use the service keyword in SPARQL to reference other endpoints, and that can provide a type of linking. For instance, knowing that DBpedia has a SPARQL endpoint, I can run a local query that incorporates DBpedia with something like this:

    select ?person ?localValue ?publicName {
      ?person :hasLocalValueOfInterest ?localValue
      service <http://dbpedia.org/sparql> {
        ?person foaf:name ?publicName 
      }
    }
    

    You can use multiple service blocks to aggregate data from multiple endpoints; you're not limited to just one. That seems pretty "linked" to me.