When using SPARQL to query RDF dataset, the same query can be written in many different ways. For example, sparql queries are always permutation-invariant with respect to some clauses inside it. Also, we can rename the variables inside a sparql query. But how can we identify those identical SPARQL queries? Ideally, there should be a python package that can parse a sparql query (i.e., a string object) into a query object, and different strings share the same underlying query are parsed into the same object, then we can simply compare the parsed query objects to determine whether two sparql queries are identical. Is there any tool like this (seems prepareQuery()
in rdflib
doesn't work in this way)? If not, then what should I do?
Semantically identical queries example:
SELECT ?x WHERE { ?x foaf:haha ?k .\n ?person foaf:knows ?x .}
SELECT ?s WHERE { ?person foaf:knows ?s .\n ?s foaf:haha ?k .}
The paper "Generating SPARQL Query Containment Benchmarks using the SQCFramework" by Muhammad Seleem et al., mentions "SPARQL query containment solvers" where
Query containment is the problem of deciding if the result set of a query Q1 is included in the result set of another query Q2
If you use such a solver to test whether the result set of Q1 is a subset of Q2 and vice versa, you have established that they are semantically identical.
As for your "off-the-shelf tool": the former paper mentions that those are tested in another paper "Evaluating and benchmarking sparql query containment solvers." by M.W. Chekol et al..
As for the complexity and computability, the latter paper mentions:
The query containment problem for full SPARQL is undecidable [15, 1]. Hence, it is necessary to reduce SPARQL in order to consider it. A double exponential upper bound has been proven for the containment and equivalence problems of SPARQL queries without OPTIONAL , FILTER and under set semantics [7].
However, query containment in both directions is only one way to determine identity of queries. I am unaware whether there is a proof of a better complexity/computability for query identity than for query containment (or a proof on the contrary).