Search code examples
sparqlwikipediaontologydbpediainfobox

Getting JSON/Dictionary of all properties in DBPedia for a page/resource from Wikipedia Infobox


I'm trying to get a representation of the infobox of articles on Wikipedia in a Python project. I had tried using the Wikipedia API, but the data it outputs is dirty, so I'm trying to move to DBpedia. I need to be able to query by page name, and receive a dictionary of the property names and their values for that page. For example, for the query for London, the returned dictionary would contain:

{dbpedia-owl:PopulatedPlace/areaMetro : 8382.0,
 dbpedia-owl:PopulatedPlace/areaTotal : 1572.0
 .....
 dbpedia-owl:populationDensity : 5285.0
 .....
}

etc., and from this I would be able to read all the keys that were in the Infobox. I did try using the SPARQL query of

describe <http://dbpedia.org/resource/London>

but that returned tonnes of unnecessary data &emdash; the full set of triplets associated with London &emdash; which is many orders of magnitude more than I need.

How can I write a query to just get the infobox properties, as above?


Solution

  • You might be able to get what you want by selecting properties and objects where the property IRI begins with something you're interested in (e.g., http://dbpedia.org/ontology/). You could use a query like the following. (It takes advantage of the fact that a prefix by itself, e.g., dbpedia-owl:, is still a legal IRI, and you can use str on it. You could also just use the string http://dbpedia.org/ontology/

    select ?p ?o where {
      dbpedia:London ?p ?o
      filter strstarts(str(?p),str(dbpedia-owl:))
    }
    

    SPARQL results (HTML Table)
    SPARQL results (JSON)

    The JSON results aren't quite in the format you're looking for, but are like this:

    { "head": { "link": [], "vars": ["p", "o"] },
      "results": { "distinct": false, "ordered": true, "bindings": [
        { "p": { "type": "uri", "value": "http://dbpedia.org/ontology/wikiPageExternalLink" }   , "o": { "type": "uri", "value": "http://mapoflondon.uvic.ca/" }},
        { "p": { "type": "uri", "value": "http://dbpedia.org/ontology/wikiPageExternalLink" }   , "o": { "type": "uri", "value": "http://www.british-history.ac.uk/place.aspx?region=1" }},
        { "p": { "type": "uri", "value": "http://dbpedia.org/ontology/wikiPageExternalLink" }   , "o": { "type": "uri", "value": "http://www.london.gov.uk/" }},
        { "p": { "type": "uri", "value": "http://dbpedia.org/ontology/wikiPageExternalLink" }   , "o": { "type": "uri", "value": "http://www.museumoflondon.org.uk/" }},
        { "p": { "type": "uri", "value": "http://dbpedia.org/ontology/wikiPageExternalLink" }   , "o": { "type": "uri", "value": "http://www.tfl.gov.uk/" }},
        { "p": { "type": "uri", "value": "http://dbpedia.org/ontology/wikiPageExternalLink" }   , "o": { "type": "uri", "value": "http://www.visitlondon.com/" }},
        { "p": { "type": "uri", "value": "http://dbpedia.org/ontology/wikiPageExternalLink" }   , "o": { "type": "uri", "value": "https://london.gov.uk/" }},
        { "p": { "type": "uri", "value": "http://dbpedia.org/ontology/wikiPageExternalLink" }   , "o": { "type": "uri", "value": "http://www.britishpathe.com/workspace.php?id=2449&delete_record=75105/" }},
        { "p": { "type": "uri", "value": "http://dbpedia.org/ontology/thumbnail" }  , "o": { "type": "uri", "value": "http://commons.wikimedia.org/wiki/Special:FilePath/Greater_London_collage_2013.png?width=300" }},
    ...
    

    That sort of makes sense though, because there's not necessarily a unique value for each property, so a Python dict as in the question probably isn't the best result format (but it'd be easy to create one where multiple values are put into a list).

    Also note that the properties that begin with dbpedia-owl: are actually the DBpedia Ontology properties, which have much cleaner data than the raw infobox values, for which properties beginning with dbpprop: are used. You can read more about the different datasets at 4.3. Infobox Data. A query for the raw properties would be pretty much the same though:

    select ?p ?o where {
      dbpedia:London ?p ?o
      filter strstarts(str(?p),str(dbpprop:))
    }
    

    SPARQL Results (HTML Table)