Search code examples
sparqlsemantic-websesametriplestorenamed-graphs

Is the speed of retrieving a query result affected by using named graphs?


I am using Sesame server for storing sets of triples.

First question

I would like to know if the repository grows huge over time and I want to run queries over it, will speed performance be affected?

Second question (if the answer for the first question is positive)

If I use named graphs for different sets of triples, and run queries on them, will I retrieve the result much faster than if I would normally run them on the entire repository?

What I want to ask is —
Is this slower:

PREFIX csm: <http://exmple.org/some_ontology.owl#>

SELECT ?b ?c
WHERE {
    ?a a csm:SomeClass.
    ?a ?b ?c.
}

than this:

PREFIX csm: <http://exmple.org/some_ontology.owl#>

SELECT ?b ?c
WHERE {
    GRAPH <http://example.org/some_graph> {
      ?a a csm:SomeClass.
      ?a ?b ?c.
    }
}

when the data set that is stored is enormously huge?


Solution

  • First question: I would like to know if the repository grows huge over time and I want to run queries over it, will speed performance be affected?

    Yes. The extent to which size influences query performance depends on a number of factors, most importantly the actual database implementation you use, how you've configured that database, but also on the shape of your actual data (e.g. the number of type-statements, etc), and of course the types of query you do. Sesame is a quadstore framework which comes with a few built-in database types (in-memory and native), but of course numerous third-party Sesame-compatible RDF databases exist that each have their own performance characteristics.

    Second question (if the answer for the first question is positive): If I use named graphs for different sets of triples, and run queries on them, will I retrieve the result much faster than if I would normally run them on the entire repository?

    Again, it depends on the database and its configuration you use, and the kinds of queries you use.

    Let's assume you are using a Sesame native store, and have enabled at least one index in which the named graph (or "context" as it is called in Sesame) is the primary key (e.g. cspo) - and in addition you have the usual default indices (i.e. spoc and posc). In this scenario, using named graphs can make a marked difference in performance if you can use it as a filter (that is, the named graph itself pre-selects a specific subset of the total potential result): the query planner can use the cspo index to quickly zoom in on a much smaller subset of the total repository.

    Note however, that in your specific example queries, it will not matter much: you are assuming, in your example, that all resources of type csm:someClass occur in exactly one particular named graph (if that were not the case the two queries would of course not return the same result), so actually selecting that named graph does not further reduce the potential answer set (when compared to just selecting all resources of type csm:someClass).

    To explain in more detail: the query engine will do lookups in the indices for each of the graph patterns in your query. The first pattern (?a a csm:someClass) is the cheapest to look up, since it has only one free variable. The engine will use the posc index for this purpose, since it knows the first two keys for this index. The second pattern of the query will be primed by the result of the first (so ?a will be instantiated by the outcome of the first lookup). In the query with named graph, the engine will select the cspo index, because we know both c and s. In the query without named graph, it will select the spoc index, since we know s (but not c). However, because all values with that particular s always occur in the same named graph, both lookups will actually range over the almost exactly the same number of values: all possible value-combinations of o and p. The spoc index will of course also range over c, but there will only ever be one single value for it, so it's a very quick lookup. So both indices will return their results in very comparable time, and knowing c in advance does not give a performance boost (as an aside, I am somewhat oversimplifying the workings of the query engine here to illustrate the point).

    Named graphs are a great tool for data organization purposes, and if you have them, using them in your queries is a good idea as it can help in performance (and will certainly not hurt). But I would not organize my data in named graphs purely for query performance purposes.