Search code examples
sparqlrdfsemantic-webgraphdb

Speeding up SPARQL query on GraphDB


I'm trying to speed up and to optimize this query

select distinct ?root where { 
    ?root a :Root ;
          :hasnode* ?node ;
          :hasnode* ?node2 .

    ?node a :Node ;
           :hasAnnotation ?ann .
    ?ann :hasReference ?ref .
    ?ref a :ReferenceType1 .

    ?node2 a :Node ;
            :hasAnnotation ?ann2 .
    ?ann2 :hasReference ?ref2 .
    ?ref2 a :ReferenceType2 .

}

Basically, I'm analyzing some trees and I want to get all trees (i.e., trees' roots) which do have at least a couple of underlying nodes with a pattern like this one:

?node_x a :Node ;
       :hasAnnotation ?ann_x .
?ann_x :hasReference ?ref_x .
?ref_x a :ReferenceTypex .

one with x = 1 and the other with x = 2.

Since in my graph one node may have at most one :hasAnnotation predicate, I do not have to specify that those nodes must be different.

The problem

The aforementioned query describes what I need but does have a very bad performance. After minutes and minutes of execution, it is still running.

My (ugly) solution: breaking it in half

I noticed that if a look for a node pattern at a time, I get my result in some seconds(!).

Sadly enough, my current approach consists in running the following query type twice:

select distinct ?root where { 
    ?root a :Root ;
          :hasnode* ?node .

    ?node a :Node ;
           :hasAnnotation ?ann_x .
    ?ann_x :hasReference ?ref_x .
    ?ref_x a :ReferenceTypex .
}

one with x = 1 and the other with x = 2.

Saving partial results (i.e., ?roots) in 2 sets, let's say R1 and R2 and finally calculating intersection between those resultsets.

Is there a way to speed up my initial approach to get results just by leveraging SPARQL?

PS: I'm working with GraphDB.


Solution

  • Without knowing the specific dataset I can give you only some general directions how to optimize the query:

    Avoid using DISTINCT for large datasets

    The GraphDB query optimiser will not rewrite automatically the query to use EXISTS for all patterns not participating in the projection. The query semantics is to find that there is at least one such pattern, but not give me all bindings and then eliminate the duplicated results.

    Materialize the property paths

    GraphDB has a very efficient forward chaining reasoner and relatively not so optimised property path expansion. If you are not concerned for the write/data update performance, I suggest you to declare :hasNode as a transitive property (see owl:TransitiveProperty in query), which will eliminate the property path wildcard. This will boost many times the query speed.

    Your final query should look like:

    select ?root where { 
        ?root a :Root ;
              :hasnode ?node ;
              :hasnode ?node2 .
    
        FILTER (?node != ?node2)
    
        FILTER EXISTS {
            ?node a :Node ;
                   :hasAnnotation ?ann .
            ?ann :hasReference ?ref .
            ?ref a :ReferenceType1 .
        }
    
        FILTER EXISTS {
            ?node2 a :Node ;
                    :hasAnnotation ?ann2 .
            ?ann2 :hasReference ?ref2 .
            ?ref2 a :ReferenceType2 .
        }
    }