Search code examples
sparql

Can this SPARQL search query be made more efficient?


I have a compound 'search' query made in SPARQL that

(1) Searches for unique subject URIs that are of a certain rdf:type:

Example:

SELECT ?s FROM NAMED <http://www.example.org/graph1> FROM NAMED <http://www.example.org/graph2>
{
   GRAPH ?g
   {
      ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.example.org/widget>.
   }
} OFFSET 10000 LIMIT 100

This query is quite simple and just returns all subjects of type 'widget'.

(2) For the returned page of satisfying subject URIs, search for all subject URIs that have a reference to those subject URIs (i.e. referencing entities), specifying the reference predicate URIs that indicate a reference.

Let's say the previous query (1) returned 2 subject URIs http://www.example.org/widget100 and http://www.example.org/widget101 and the referencing predicate I wanted to query for was http://www.example.org/widget:

Example:

SELECT ?s FROM NAMED <http://www.example.org/graph1> FROM NAMED <http://www.example.org/graph2>
WHERE {
   UNION
   {
      ?s <http://www.example.org/widget> <http://www.example.org/widget100>
   }
   UNION
   {
      ?s <http://www.example.org/widget> <http://www.example.org/widget101>
   }
}

If the previous page returned 100 subject URIs, there would be 100 'UNION' statements here for each subject.

This query works - it selects the subject URIs of the given type, and returns the additional subject URIs that reference those subjects with the given reference predicate.

The problem is in practice, when I have 100,000s of triples across my query graphs, even on a fast machine on an in-memory graph this query is taking typically 1 minute+ to execute. This is unacceptably slow for users for this fairly typical search scenario.

Under profiling, both queries take roughly 50% of the query time.

I have enough experience with SPARQL to construct such a query above, but I am certainly not an expert. I am wondering if this could be made more efficient. For example, could it be combined into a single query that might at least reduce query times by 50%+? Is my use of UNIONs across potentially many subjects replacable by a more efficient method?

Thank you

SPARQL Guy

UPDATE: I have managed to reduce the query down to a single query of the following form:

SELECT  *
  FROM NAMED <http://www.example.org/widgets>
  FROM NAMED <http://www.example.org/widgetstats>
  FROM NAMED <http://www.example.org/widgetmetadata>
  FROM NAMED <http://www.example.org/widgetfactory>
  WHERE
    {   { SELECT  ?s ?p ?o
          WHERE
            { GRAPH ?g
                { ?s  ?p  ?o }
              { SELECT  ?s
                WHERE
                  { GRAPH ?i
                      { ?s  a  <http://www.example.org/widget> }
                  }
                OFFSET  0
                LIMIT   100
              }
            }
        }
      UNION
        { SELECT  ?s ?p ?o
          WHERE
            { GRAPH ?g
                { ?s  ?p  ?o }
              { SELECT DISTINCT  ?s
                WHERE
                  { GRAPH ?h
                      { OPTIONAL
                          { ?s  <http://www.example.org/widgetstats/widget>  ?x }
                        OPTIONAL
                          { ?s  <http://www.example.org/widgetmetadata/widget>  ?x }
                        OPTIONAL
                          { ?s  <http://www.example.org/widgetfactory/widget>  ?x }
                      }
                    { SELECT  ?x
                      WHERE
                        { GRAPH ?i
                            { ?x  a  <http://www.example.org/widget> }
                        }
                      OFFSET  0
                      LIMIT   100
                    }
                  }
              }
            }
        }
    }

This improves query speed by approx. 50%. The query can, though, I think be made faster. This form of query - fetching first all triples associated with the primary entities of the given type followed by all the triples associated with the referencing entities - requires two identical innermost subqueries, fetching the unique subjects of the given type.

Is there any way of reducing this query down - perhaps performing with a single query instead of a UNION of two subqueries? I am assuming this will probably improve performance further.

UPDATE 2: I couldn't improve on the query above (first update) and so I will make this as the answer for now.


Solution

  • If you still want the paging of the first query then probably the best approach would be to combine the queries using a SPARQL subquery.

    Note that with subqueries you work from the inside out, so the subquery selects the widgets and the outer query expands to find the references. If you are using FROM NAMED then you need to match on the graph (assuming your results are in a named graph and you aren't working with a union default graph). The OFFSET and LIMIT on the inner query means that the example below returns references to the 3rd widget (in whatever default sort order the engine is applying).

    I'm not sure if this will speed up the overall query time, but worth experimenting with and saves you a bunch of string concantenation!

    PREFIX ex: <http://www.example.org/>
    SELECT ?s FROM NAMED ex:g1 FROM NAMED ex:g2 WHERE {
      GRAPH ?h {
        ?s ex:widget ?x
      }
      {
        SELECT ?x WHERE {
          GRAPH ?g {
            ?x a ex:widget
          } 
        } OFFSET 2 LIMIT 1
      }
    }