Search code examples
sparqlsemantic-websimilaritydbpedialinked-data

To use iSPARQL to compare values using similarity measures


I have a question for you.

I would like to write a query that retrieves the values ​​that are similar (given a function of similarity, such as Lev) to a given string "Londn" to make the comparison with the predicate "RDFS:label" of DBPedia. In Output, for example, I would like to get the value of "London". I have read that a usable approach might be to use iSPARQL ("Imprecise SPARQL") although it is not very widely used in the literature.

Can I use iSPARQL or is there some SPARQL approach to perform the same operations?


Solution

  • Short Version — You can do some of this in pure SPARQL

    You can use a query like this to find cities whose names are like "Londn", and order them by (one measure of) similarity. The rest of the answer explains how this works:

    select ?city ?percent where {
      ?city a dbpedia-owl:City ;
            rdfs:label ?label .
      filter langMatches( lang(?label), 'en' )
    
      bind( replace( concat( 'x', str(?label) ), "^x[^Londn]*([L]?)[^ondn]*([o]?)[^ndn]*([n]?)[^dn]*([d]?)[^n]*([n]?).*$", '$1$2$3$4$5' ) as ?match )
      bind( xsd:float(strlen(?match))/strlen(str(?label)) as ?percent )
    }
    order by desc(?percent)
    limit 100
    

    SPARQL results

    city                                  percent
    ----------------------------------------------
    http://dbpedia.org/resource/London    0.833333
    http://dbpedia.org/resource/Bonn      0.75
    http://dbpedia.org/resource/Loudi     0.6
    http://dbpedia.org/resource/Ladnu     0.6
    http://dbpedia.org/resource/Lonar     0.6
    http://dbpedia.org/resource/Longnan   0.571429
    http://dbpedia.org/resource/Longyan   0.571429
    http://dbpedia.org/resource/Luoding   0.571429
    http://dbpedia.org/resource/Lodhran   0.571429
    http://dbpedia.org/resource/Lom%C3%A9 0.5
    http://dbpedia.org/resource/Andong    0.5
    

    Computing string similarity metrics

    Note: the code in this part of this answer works in Apache Jena. There's actually an edge case that causes this to (correctly) fail in Virtuoso. The update at the end addresses this issue.

    There's nothing built into SPARQL for computing string matching distances, but you can do some of it using the regular expression replacement mechanism in SPARQL. Suppose you want to match the sequence "cat" in some strings. Then you can use a query like this to figure out how much of a given string is present in the sequence "cat":

    select ?string ?match where {
      values ?string { "cart" "concatenate" "hat" "pot" "hop" }
      bind( replace( ?string, "^[^cat]*([c]?)[^at]*([a]?)[^t]*([t]?).*$", "$1$2$3" ) as ?match )
    }
    
    -------------------------
    | string        | match |
    =========================
    | "cart"        | "cat" |
    | "concatenate" | "cat" |
    | "hat"         | "at"  |
    | "pot"         | "t"   |
    | "hop"         | ""    |
    -------------------------
    

    By examining the length of string and match, you should be able to compute some various similarity metrics. As a more complicated example that uses the "Londn" input you mentioned. The percent column is the percent of the string that matched the input.

    select ?input
           ?string
           (strlen(?match)/strlen(?string) as ?percent)
    where {
      values ?string { "London" "Londn" "London Fog" "Lando" "Land Ho!"
                       "concatenate" "catnap" "hat" "cat" "chat" "chart" "port" "part" }
    
      values (?input ?pattern ?replacement) {
        ("cat"   "^[^cat]*([c]?)[^at]*([a]?)[^t]*([t]?).*$"                              "$1$2$3")
        ("Londn" "^[^Londn]*([L]?)[^ondn]*([o]?)[^ndn]*([n]?)[^dn]*([d]?)[^n]*([n]?).*$" "$1$2$3$4$5")
      }
    
      bind( replace( ?string, ?pattern, ?replacement) as ?match )
    }
    order by ?pattern desc(?percent)
    
    --------------------------------------------------------
    | input   | string        | percent                    |
    ========================================================
    | "Londn" | "Londn"       | 1.0                        |
    | "Londn" | "London"      | 0.833333333333333333333333 |
    | "Londn" | "Lando"       | 0.6                        |
    | "Londn" | "London Fog"  | 0.5                        |
    | "Londn" | "Land Ho!"    | 0.375                      |
    | "Londn" | "concatenate" | 0.272727272727272727272727 |
    | "Londn" | "port"        | 0.25                       |
    | "Londn" | "catnap"      | 0.166666666666666666666666 |
    | "Londn" | "cat"         | 0.0                        |
    | "Londn" | "chart"       | 0.0                        |
    | "Londn" | "chat"        | 0.0                        |
    | "Londn" | "hat"         | 0.0                        |
    | "Londn" | "part"        | 0.0                        |
    | "cat"   | "cat"         | 1.0                        |
    | "cat"   | "chat"        | 0.75                       |
    | "cat"   | "hat"         | 0.666666666666666666666666 |
    | "cat"   | "chart"       | 0.6                        |
    | "cat"   | "part"        | 0.5                        |
    | "cat"   | "catnap"      | 0.5                        |
    | "cat"   | "concatenate" | 0.272727272727272727272727 |
    | "cat"   | "port"        | 0.25                       |
    | "cat"   | "Lando"       | 0.2                        |
    | "cat"   | "Land Ho!"    | 0.125                      |
    | "cat"   | "Londn"       | 0.0                        |
    | "cat"   | "London"      | 0.0                        |
    | "cat"   | "London Fog"  | 0.0                        |
    --------------------------------------------------------
    

    Update

    The code above works in Apache Jena, but fails in Virtuoso, due to the fact that the pattern can match the empty string. E.g., if you try the following query on DBpedia's endpoint (which is powered by Virtuoso), you'll get the following error:

    select (replace( "foo", ".*", "x" ) as ?bar) where {}
    

    Virtuoso 22023 Error The regex-based XPATH/XQuery/SPARQL replace() function can not search for a pattern that can be found even in an empty string

    This surprised me, but the spec for replace says that it's based on XPath fn:replace. The docs for fn:replace say:

    An error is raised [err:FORX0003] if the pattern matches a zero-length string, that is, if the expression fn:matches("", $pattern, $flags) returns true. It is not an error, however, if a captured substring is zero-length.

    We can get around this problem, though, by adding a character to the beginning of both the pattern and the string:

    select ?input
           ?string
           (strlen(?match)/strlen(?string) as ?percent)
    where {
      values ?string { "London" "Londn" "London Fog" "Lando" "Land Ho!"
                       "concatenate" "catnap" "hat" "cat" "chat" "chart" "port" "part" }
    
      values (?input ?pattern ?replacement) {
        ("cat"   "^x[^cat]*([c]?)[^at]*([a]?)[^t]*([t]?).*$"                              "$1$2$3")
        ("Londn" "^x[^Londn]*([L]?)[^ondn]*([o]?)[^ndn]*([n]?)[^dn]*([d]?)[^n]*([n]?).*$" "$1$2$3$4$5")
      }
    
      bind( replace( concat('x',?string), ?pattern, ?replacement) as ?match )
    }
    order by ?pattern desc(?percent)
    
    --------------------------------------------------------
    | input   | string        | percent                    |
    ========================================================
    | "Londn" | "Londn"       | 1.0                        |
    | "Londn" | "London"      | 0.833333333333333333333333 |
    | "Londn" | "Lando"       | 0.6                        |
    | "Londn" | "London Fog"  | 0.5                        |
    | "Londn" | "Land Ho!"    | 0.375                      |
    | "Londn" | "concatenate" | 0.272727272727272727272727 |
    | "Londn" | "port"        | 0.25                       |
    | "Londn" | "catnap"      | 0.166666666666666666666666 |
    | "Londn" | "cat"         | 0.0                        |
    | "Londn" | "chart"       | 0.0                        |
    | "Londn" | "chat"        | 0.0                        |
    | "Londn" | "hat"         | 0.0                        |
    | "Londn" | "part"        | 0.0                        |
    | "cat"   | "cat"         | 1.0                        |
    | "cat"   | "chat"        | 0.75                       |
    | "cat"   | "hat"         | 0.666666666666666666666666 |
    | "cat"   | "chart"       | 0.6                        |
    | "cat"   | "part"        | 0.5                        |
    | "cat"   | "catnap"      | 0.5                        |
    | "cat"   | "concatenate" | 0.272727272727272727272727 |
    | "cat"   | "port"        | 0.25                       |
    | "cat"   | "Lando"       | 0.2                        |
    | "cat"   | "Land Ho!"    | 0.125                      |
    | "cat"   | "Londn"       | 0.0                        |
    | "cat"   | "London"      | 0.0                        |
    | "cat"   | "London Fog"  | 0.0                        |
    --------------------------------------------------------