Search code examples
sparqlnamed-graphs

What is a 'dataset' in the context of a SPARQL query?


The SPARQL specification mentions that the FROM clause can be used to specify a dataset.

A SPARQL query may specify the dataset to be used for matching by using the FROM clause and the FROM NAMED clause to describe the RDF dataset.

What is a "dataset" in the context of SPARQL? I'm very familiar with databases in general, and I understand in principle that a query for data phrased in a language such as SQL is then executed against a dataset to resolve some subset of that dataset.

I'm trying to understand the following query:

prefix cpmeta: <...some_domain>

select distinct
?uri
?label
?stationId

from <...some_domain>
from <...some_domain>
from <...some_domain>
from <...some_domain>
from named <...some_domain>

where {

    { ?uri rdfs:label ?label }

    UNION

    { ?uri cpmeta:hasName ?label }

    UNION 

    {
        graph <...some_domain> {
            ?uri a cpmeta:Station .
            ?uri cpmeta:hasName ?label .
        }
    }

    ?uri cpmeta:hasStationId ?stationId
}

limit 100

So from the specification documentation I understand in principle that

  1. There are 4 datasets specified, and (I think)
  2. One 'RDF dataset' is defined

However. The query actually executes (but with slightly different results) if I leave out the FROM and FROM NAMED clauses:

prefix cpmeta: <...some_domain>

select distinct
?uri
?label
?stationId

where {

    { ?uri rdfs:label ?label }

    UNION

    { ?uri cpmeta:hasName ?label }

    UNION 

    {
        graph <...some_domain> {
            ?uri a cpmeta:Station .
            ?uri cpmeta:hasName ?label .
        }
    }

    ?uri cpmeta:hasStationId ?stationId
}

limit 100

So clearly??? there is already a dataset specified. Is that via the prefix?

Questions:

  1. Why is an RDF dataset identified differently to a regular dataset (FROM vs FROM NAMED)
  2. The URI for the prefix is actually reused in a FROM statement. What is the difference between a prefix and a FROM clause?

This question - Specifying dataset within a SPARQL query - shows how to specify a dataset, but doesn't explain what that means in the context of a SPARQL query and in the context of however that SPARQL query is resolved to actual data.

This question - FROM clause in SPARQL queries - mentions that a SPARQL query without a FROM clause is executed against a default dataset. But then why would omitting all datasets still result in data returned by the query?


Solution

  • Comparing the execution of a SPARQL query with SQL queries is a bit tricky. SPARQL is more high level.

    Datasets

    An endpoint (e.g. a database like Virtuoso, GraphDB) has some freedom (not) to implement SPARQL concepts.

    The dataset is such a concept. Usually a graph database allows you to create a repository which is equivalent to a database in the SQL world. Inside this triples are stored, and these triples can be grouped in named graphs. The GRAPH construct helps you te select which set to look in.

    The repository is the dataset you are referring to.

    Very few databases support querying datasets/repositories that are not hosted in that same database. For very obvious reasons.

    SPARQL

    The less precise your query, the more data it is matched to. Using the GRAPH <...> {} can narrow down the sets to match some triples to without the need to specify a full sub query

    Don't confuse datasets with namespaces. The ID's in the world of RDF are always a URI's. The first part of a URI usually mentions the organisation that invented the ID. But still, they are just the ID. Using prefixes makes the ID look shorter.

    You could put each triple in a separate graph, which turns the name of the graph into an identifier of the triple. This is not intended, but also not forbidden usage.