Search code examples
sparqlwikipediadbpediasparqlwrapper

How to extract all articles in subcategories recursively using dbPedia?


I need to extract information about articles (e.g., abstract, thumbnail) which located on the different nested subcategories of given category (e.g., History). How can I do that using SPARQL query? Or what is the optimal way to do that on python with a few SPARQL subqueries?


Solution

  • This gets all ?sc "subcategories" that are recursively (or transitively) narrower than "History", up to a depth of 3. I implemented that with the {minDepth,maxDepth} notation that Virtuoso understands. Other triplestores may not understand it. I have also added English-language filtering on string literals, while still retaining triples with IRIs for ?o.

    SELECT ?sc ?lab ?p ?o 
    WHERE {
      ?sc skos:broader{1,3} <http://dbpedia.org/resource/Category:History> .
      optional {?sc rdfs:label ?lab  } .
      ?sc ?p ?o 
      filter (lang(?lab) = "en")
      filter ((lang(?o) = "en") || isURI(?o))
    } 
    

    Additionally, that query reports all of the triples with ?sc as the subject. I didn't see any abstracts (using <http://dbpedia.org/ontology/abstract> as predicate?) or any thumbnail relationships. You can confirm that by projecting only distinct ?p, or even counting:

    SELECT ?p (count(?p) as ?pcount)
    WHERE {
      ?sc skos:broader{1,3} <http://dbpedia.org/resource/Category:History> .
      optional {?sc rdfs:label ?lab  } .
      ?sc ?p ?o 
      filter (lang(?lab) = "en")
      filter ((lang(?o) = "en") || isURI(?o))
    } 
    group by ?p
    order by desc(?pcount)
    

    If you do deeper recursion, you will find some abstracts. But the deep recursion is slow and I feel like I'm conceptually missing something.

    SELECT *
    WHERE {
      ?sc skos:broader{5,7} <http://dbpedia.org/resource/Category:History> .
      ?sc <http://dbpedia.org/ontology/abstract> ?a 
    }