Search code examples
sparqlwikidata

How to get the Wikidata labels for approximate terms?


I am using the below mentioned query to obtain the wikidata lable for a given term.

SELECT ?item WHERE {
  ?item rdfs:label "Word2vec"@en
}

The output is wd:Q22673982

However, when I spell Word2vec as word2vec (i.e. all characters are simple letters) I get "No results" from the above query.

Therefore, I would like to know if there is a way to get how the term is in wikidata and get its label?

i.e. if I enter with all characters lower cased, how to identify the equivalent wikidata term and return its corresponding label?


Solution

  • If you're unsure of the precise spelling or capitalisation, you can use a filter function to perform the match. For example, to match regardless of capitalisation, you could use the LCASE() (or UCASE()) function, as follows:

    SELECT ?item WHERE {
      ?item rdfs:label ?label
      FILTER(LCASE(STR(?label)) = "word2vec")
    }
    

    This transforms any found label to lower-case and the compares to the lower-case string.

    There's a whole host of different functions you can use for string manipulation, there's good overview in the SPARQL 1.1 W3C Recommendation.

    NOTE doing this kind of query is significantly more expensive (in terms of execution time), because the engine will have to do a sequential scan over all possible matches. Like @AKSW mentioned in the commments, the query as-is is likely to time out when you execute it on the Wikidata public endpoint. It would probably help a lot if you made the query more specific by adding additional triple patterns.

    Update If you have a look at the information available for wd:Q22673982 (you can browse it at https://www.wikidata.org/wiki/Q22673982 ) you'll see that, among other things, it's a subclass of "word embeddding" (wd:Q18395344). So what you could do for example, instead of just asking for every ?item that has a rdfs:label, is ask for all items that are a subclass of wd:Q18395344 and have this label, like this:

    SELECT DISTINCT ?item WHERE {
      ?item wdt:P279 wd:Q18395344;
            rdfs:label ?label
      FILTER(LCASE(STR(?label)) = "word2vec")
    }
    

    Unfortunately, Wikidata uses rather cryptic identifiers for its properties and relations. Suffice to say that wdt:P279 corresponds to the "subclass" relation. The DISTINCT was something I added because otherwise you get the same answer 10 or more times.