Search code examples
sparqlblazegraphblank-nodes

Grouping by blank nodes


I have the following data:

@prefix f: <http://example.org#> .

_:a f:trait "Rude"@en .
_:a f:name "John" .
_:a f:surname "Roy" .
_:b f:trait "Crude"@en .
_:b f:name "Mary" .
_:b f:surname "Lestern" .

However, if I execute the following query in Blazegraph:

PREFIX f: <http://example.org#>

SELECT ?s ?o
WHERE
{
    ?s f:trait ?o .
}

I get six results:

s   o
t32 Crude
t37 Crude
t39 Crude
t31 Rude
t36 Rude
t38 Rude

If blank nodes _:a and _:b are distinct nodes, how should I write a SPARQL query to return only two distinct results? I have tried SELECT DISTINCT, but it still returns six results. I have tried grouping by ?o, but Blazegraph returns an error, saying it's a bad aggregate. Why does this kind of output of repeating tuples happen? And how to avoid it?


Solution

  • The problem is that you have inserted data containing blank nodes several times. This operation is not idempotent.

    Useful quotes

    From RDF 1.1 Concepts and Abstract Syntax:

    Blank node identifiers are local identifiers that are used in some concrete RDF syntaxes or RDF store implementations. They are always locally scoped to the file or RDF store, and are not persistent or portable identifiers for blank nodes.

    From RDF 1.1 Semantics:

    RDF graphs can be viewed as conjunctions of simple atomic sentences in first-order logic, where blank nodes are free variables which are understood to be existential. Taking the union of two graphs is then analogous to syntactic conjunction in this syntax. RDF syntax has no explicit variable-binding quantifiers, so the truth conditions for any RDF graph treat the free variables in that graph as existentially quantified in that graph. Taking the union of graphs which share a blank node changes the implied quantifier scopes.

    From SPARQL 1.1 Query Language:

    Blank node labels are scoped to a result set.

    There need not be any relation between a label _:a in the result set and a blank node in the data graph with the same label.

    An application writer should not expect blank node labels in a query to refer to a particular blank node in the data.

    From SPARQL 1.1 Update:

    Blank nodes... are assumed to be disjoint from the blank nodes in the Graph Store, i.e., will be inserted with "fresh" blank nodes.

    Some discussion

    Different triplestores provides solutions for the "problems" described. E.g., Jena allows to use pseudo-URIs like <_:b1> etc.