Search code examples
sparqlrdfjenasesame

Appropriate repositories in Jena for property paths with arbitrary lengths


I am running SPARQL 1.1 queries with property paths with arbitrary lengths. I can run these queries very efficiently in Sesame Sail Repository. However, they run very slow with Dataset(created from Graph) or Model(TDB) in Jena.

Is there any other possibility in Jena other than TDB or Graph?

Example: For a 60 MB n3 rdf file, with about 600,000 triples and the following query:

SELECT ?x ?y {
 ?x <http://relationship.com/wasRevisionOf>+ ?y .
 ?x <http://relationship.com/wasGeneratedBy>/<http://relationship.com/wasAssociatedWith> ?z1 .
  ?y <http://relationship.com/wasGeneratedBy>/<http://relationship.com/wasAssociatedWith> ?z2 .
  FILTER(?z1 = ?z2  && ?x=<http://article.com/524910968> && ?y=<http://article.com/524753791>) 
} LIMIT 3

With Jena TDB it takes 14 Seconds to execute this query, JENA Graph about 38 seconds and in Sesame Sail Repository Memory Store it takes only 100-150 ms.*

  • this 100-150 ms holds for each files size from 1mb to 200 mb, the needed triples are included in all the files.

Solution

  • While I do suggest that you try TDB to take advantage of disk based indexing, there are a few things here that I would point out that could help any SPARQL engine do better. In your query, you have a filter with a few rather simple conditions. A possible issue is that conceptually a filter says to get the possible results, and then trim them down. Now, for simple conditions, a good optimizer might recognize filters that can be applie during the query to prevent excess work.

    In this case, you're asking for ?z1 = ?z2 when you could simply use one variable instead of two. You've also got some filters that just set specified values for ?x and ?y when you could simply use the value or a values block. Hopefully this won't make any difference, but do consider some rewrites along these lines:

    select ?x ?y {
      values (?x ?y) { (<...> <...>) }
      ?z ^(:wasGeneratedBy/:wasAssociatedWith) ?x, ?y .
      ?x :wasRevisionOf+ ?y .
    }
    limit 3
    

    Another thing that might help, in general (but not necessarily in your case), is that the search, as phrased, could naively be performed as starting at a ?z value, finding values for ?x and ?y, and then checking whether there is a suitable path between ?x and ?y. But since ?x and ?y could match in either order, and the revision path might only go in one direction, it might make sense to look for suitable ?x ?y pairs first in a subquery, and then to find ?z values in an outer query. This probably doesn't matter in your case though, since you're fixing the values of ?x and ?y from the beginning.