Search code examples
rdfsemantic-web

Handling duplication of Triples


The situation

Suppose we have 2 triple files like this:

  • data1.triple (from "data source A") prefix:personX vocab:name "X". prefix:personX vocab:birthdate "2000-01-01".

  • data2.triple (from "data source B") prefix:personX vocab:name "X". prefix:personX vocab:birthdate "2000-01-01".

Because data1 and data2 are excatly same, each of name and birthdate will be imported once.

But what if data1 and data2 have a difference value of personX's date of birth like this:

  • data1.triple (from "data source A") prefix:personX vocab:name "X". prefix:personX vocab:birthdate "2000-01-01".
  • data2.triple (from "data source B") prefix:personX vocab:name "X". prefix:personX vocab:birthdate "1999-01-01".

In this case, I just want to load one of "2000-01-01" or "1999-01-01" because having 2 dates of birth does not make sense.

Question

Is there any mechanism or directive or any sort of concepts to describe:

  • "some predicate should have one edge per one Entity"
  • "data source A" has a higher precedence than "data source B's"

So that 'personX' has the 'name' predicate exactly once.


Solution

  • There's nothing that will let you constrain what can appear in the data. RDF is a set of triples, and that's all you get. However, that doesn't mean that you're without hope. Let's address your second question first:

    • "data source A" has a higher precedence than "data source B's"

    If you use an RDF dataset with named graphs, which is very common to do with SPARQL, you can put the data from each of your sources into a named graph, and then you could select from one with higher priority than the other. E.g., something like:

    select ?birthdate {
      values (?priority ?graph) { (1 :A) (2 :B) }
      graph ?graph { :person :birthdate ?birthdate }
    }
    order by ?priority
    limit 1
    

    Then you'd get any birthdate properties from graph A before getting any from graph B.

    A less extensible approach, but still suitable if you only have the two graphs and you know that there's at most one value in each one would be to use coalesce:

    select (coalesce(?birthdateA, ?birthdateB) as ?birthdate) {
      graph :A { :person :birthdate ?birthdateA }
      graph :B { :person :birthdate ?birthdateB }
    }
    
    • "some predicate should have one edge per one Entity"

    It's easy to check for violations using SPARQL. You'd just do something like this to identify the problematic data:

    select ?badPerson {
      ?badPerson :birthdate ?birthdate
    }
    group by ?badPerson
    having (count(distinct ?birthdate) != 1)
    

    To specify that there should only be one value, you'd need to start using an ontology language such as OWL, wherein you could state, for instance, that:

            Person SubClassOf (hasBirthdate exactly 1)

    Now, that won't keep someone from asserting inconsistent data, but an OWL reasoner with support for datatype reasoning will be able to recognize an inconsistency if one appears.