Search code examples
javasparqljenalevenshtein-distancesimilarity

SPARQL: how to find similar strings?


I'm using Jena to query data stored in an ontology. Some of the objects are identified by a string, however sometimes the exact same string is not available, as I am processing scanned documents and so there may be OCR-Errors. Therefore, I'd like to find the most similar strings. Is there a way to use SPARQL for this purpose? Can I somehow calculate levenshtein distance in SPARQL?

If this is not possible, I can still calculate the levenshtein distance in java. However, an efficient algorithm would still require to filter out irrelevant strings using SPARQL.


Solution

  • SPARQL can't do this directly, but you could implement the levenshtein distance function in java, and use it in a SPARQL FILTER clause. Extensions in ARQ has details about using extension functions.