Search code examples
sparqldbpediachemistry

DBPedia queries missing certain chemical compounds


I am running this query to get list of all compounds from the DBPedia public SPARQL endpoint.

SELECT * WHERE {
  ?y rdf:type dbpedia-owl:Drug.
  ?y rdfs:label ?Name .
  OPTIONAL {?y dbpedia-owl:iupacName ?iupacname} .
  OPTIONAL {?y dcterms:subject ?y1}
  FILTER (langMatches(lang(?Name),"en"))
}
LIMIT 50000

I am downloading in batches of 50000 (2 files) using offset parameter.

Somehow Isopropyl_alcohol is not getting covered in this even where page exists at

and it has the properties that I am searching for?


Solution

  • There are two issues here. The first is that DBpedia Live and DBpedia do not have exactly the same content. According to the DBpedia live webpage

    Wikipedia users constantly revise Wikipedia articles with updates happening almost each second. Hence, data stored in the official DBpedia endpoint can quickly become outdated, and Wikipedia articles need to be re-extracted. DBpedia Live enables such a continuous synchronization between DBpedia and Wikipedia.

    That page also lists two SPARQL endpoints for DBpedia Live:

    However, you'll run into issues on both. Isopropyl_alcohol is in DBpedia, and its URI is

    Looking there, we see that Isopropyl alcohol doesn't have rdf:type dbpedia-owl:Drug, but only

    so you won't be able to find it with your query on DBpedia, because it doesn't have the type `dbpedia-owl:Drug. Now, Isopropyl_alcohol also exists in DBpedia live, and its URL is

    but it only has the folllowing rdf:types:

    so it won't be found by your query on DBpedia Live, for the same reason.

    The second issue is the one that AndyS pointed out. Even if the query would select Isopropyl_alcohol in DBpedia or DBpedia Live, unless you provide an ordering constraint, the limit/offset combination won't be guaranteed to return it, since without an ordering constraint, the server could legitimately return the same set of 50000 results to you every time.