Search code examples
sparqlsesame

SPARQL Distinct pairs


I've got a table, where there are documents with identical authors. I need to get the distinct pairs of documents. I did the following:

SELECT DISTINCT ?d1 ?d2  WHERE {
?d1 myns:creator ?x.
?d2 myns:creator ?y.
FILTER (?x=?y && ?d1!=?d2).
}
GROUP BY ?d1 ?d2

But for this both DOC1, DOC2 and DOC2, DOC1 are in the result. I need to get rid of one of the pairs. Here is the whole triples database:

@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> . 
@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix myns: <http://my.local.namespace#> .

_:doc1 rdf:type myns:Document.
_:doc1 myns:creator _:Pete.
_:doc1 myns:year "2000"^^xsd:integer.
_:doc1 myns:publisher _:p1.

_:doc2 rdf:type myns:Document.
_:doc2 myns:creator _:John.
_:doc2 myns:year "2004"^^xsd:integer.
_:doc2 myns:publisher _:p2.


_:doc3 rdf:type myns:Document.
_:doc3 myns:creator _:Pete.
_:doc3 myns:publisher _:p3.

_:doc4 rdf:type myns:Document.
_:doc4 myns:creator _:Bob.
_:doc4 myns:year "2010"^^xsd:integer.
_:doc4 myns:publisher _:p2.

_:Pete rdf:type myns:Person.
_:Pete myns:knows _:Bob.
_:Pete myns:knows _:John .

_:John rdf:type myns:Person.
_:John myns:age "29"^^xsd:integer.
_:John myns:knows _:Bob.

_:Bob rdf:type myns:Person.
_:Bob myns:age "35"^^xsd:integer.

The result, that I am getting, after executing query is:

D1  D2
_:891f1e98-b411-4e54-9533-18d530f09c6ddoc1  _:891f1e98-b411-4e54-9533-18d530f09c6ddoc3
_:891f1e98-b411-4e54-9533-18d530f09c6ddoc3  _:891f1e98-b411-4e54-9533-18d530f09c6ddoc1

As it is noticeable, technically both pairs are same. I junst need distinct one (i.e. one of them is enough). I am not sure about enviromental characteristics. But there is Sesame framework


Solution

  • This will work in some systems:

    SELECT ?d1 ?d2  WHERE {
      ?d1 myns:creator ?x.
      ?d2 myns:creator ?y.
      FILTER (?x=?y && STR(IRI(?d1)) < STR(IRI(?d2))).
    }
    

    ?d1 and ?d2 are going to be blank nodes. But blank nodes are blank. So to provide the ordering for <, we need some kind of query-wide label or value associated with each one.

    Your data does not have any distinguishing triples for each person.It would be better to put real names in the data:

    _:Pete rdfs:label "Pete" .
    

    Even better, use the FOAF vocabulary.

    Some systems allow blank nodes in IRI() - technically it's an extension of the SPARQL specification. You can then take the STR form and compare. that works on your data for me (Apache Jena) - You don't say which RDF system you are using.

    The best solution is put distinguishing information into the data.