Search code examples
rdfsparqljenaarqtdb

How to write SPARQL query that efficiently matches string literals while ignoring case


I am using Jena ARQ to write a SPARQL query against a large ontology being read from Jena TDB in order to find the types associated with concepts based on rdfs label:

SELECT DISTINCT ?type WHERE {
 ?x <http://www.w3.org/2000/01/rdf-schema#label> "aspirin" .
 ?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type .
}

This works pretty well and is actually quite speedy (<1 second). Unfortunately, for some terms, I need to perform this query in a case-insensitive way. For instance, because the label "Tylenol" is in the ontology, but not "tylenol", the following query comes up empty:

SELECT DISTINCT ?type WHERE {
 ?x <http://www.w3.org/2000/01/rdf-schema#label> "tylenol" .
 ?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type .
}

I can write a case-insensitive version of this query using FILTER syntax like so:

SELECT DISTINCT ?type WHERE {
 ?x <http://www.w3.org/2000/01/rdf-schema#label> ?term .
 ?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type .
 FILTER ( regex (str(?term), "tylenol", "i") )
}

But now the query takes over a minute to complete! Is there any way to write the case-insensitive query in a more efficient manner?


Solution

  • The reason the query with the FILTER query runs slower is because ?term is unbound it requires scanning the PSO or POS index to find all statements with the rdfs:label predicate and filter them against the regex. When it was bound to a concrete resource (in your first example), it could use a OPS or POS index to scan over only statements with the rdfs:label predicate and the specified object resource, which would have a much lower cardinality.

    The common solution to this type of text searching problem is to use an external text index. In this case, Jena provides a free text index called LARQ, which uses Lucene to perform the search and joins the results with the rest of the query.