Search code examples
graphrdfsparql

Computing custom histogram metrics to understand graph structure using SPARQL


I am looking to analyze the structure of a graph and one particular query I wanted to try out was to extract different combinations of subject type - edge type - object type in a graph.

This is a follow up from a couple of earlier questions of mine:

How to generate all triples that fit a particular node type or/and edge type using SPARQL query?

How to list and count the different types of node and edge entities in the graph data using SPARQL query?

For example: If there is a semantic graph with edge types(property/predicate types) as

  1. IsCapitalOf
  2. IsCityOf
  3. HasPopulation etc etc etc

And if the node types are like:

  1. Cities
  2. Countries
  3. Rivers
  4. Mountains etc

Then I should get:

  1. City->IsCapitalOf->Country 4 tuples
  2. City->IsCityOf->Country 21 tuples
  3. River->IsPartOf->Country 3
  4. River->PassesThrough->City 11

and so on...

Note: No literals in object field as I want the unit subgraph pattern fitting (subjecttype edgetype objecttype)

To summarize: I think the way I'd approach this would be:

a) Compute distinct subject types in graph b) Compute distinct edge types in graph c) Compute distinct object type in graph (a/b/c have been answered in my previous questions)

Now d) Generate all possible combinations(of subject type -> edge type -> object type(NO literals) and counts (like a histogram) of such patterns

Hope the question is articulated reasonably well.|

Edit: Adding sample data [few rows from the entire dataset] It is the yago dataset which is available publicly

<Alabama>   rdf:type    <wordnet_country_108544813> .
<Abraham_Lincoln>   rdf:type    <wordnet_president_110467179> .
<Aristotle> rdf:type    <wordnet_writer_110794014> .
<Academy_Award_for_Best_Art_Direction>  rdf:type    <wordnet_award_106696483> .
<Academy_Award> rdf:type    <wordnet_award_106696483> .
<Actrius>   rdf:type    <wordnet_movie_106613686> .
<Animalia_(book)>   rdf:type    <wordnet_book_106410904> .
<Ayn_Rand>  rdf:type    <wordnet_novelist_110363573> .
<Allan_Dwan>    rdf:type    <wikicategory_American_film_directors> .
<Algeria>   rdf:type    <wordnet_country_108544813> .
<Andre_Agassi>  rdf:type    <wordnet_player_110439851> .
<Austro-Asiatic_languages>  rdf:type    <wordnet_language_106282651> .
<Afroasiatic_languages> rdf:type    <wordnet_language_106282651> .
<Andorra>   rdf:type    <wordnet_country_108544813> .
<Animal_Farm>   rdf:type    <wordnet_novelette_106368962> .
<Alaska>    rdf:type    <wordnet_country_108544813> .
<Aldous_Huxley> rdf:type    <wordnet_writer_110794014> .
<Andrei_Tarkovsky>  rdf:type    <wordnet_film_maker_110088390> .

Solution

  • Suppose you've got data like this:

    @prefix : <http://stackoverflow.com/q/24313367/1281433/> .
    
    :City1 a :City .
    :City2 a :City .
    
    :Country1 a :Country .
    :Country2 a :Country .
    :Country3 a :Country .
    
    :River1 a :River .
    :River2 a :River .
    :River3 a :River .
    
    :City1 :isCapitalOf :Country1 .
    
    :River1 :isPartOf :Country1, :Country2 .
    :River2 :isPartOf :Country2, :Country3 .
    
    :River1 :passesThrough :City1, :City2 .
    :River2 :passesThrough :City2 .
    

    Then this query gives you the kind results you want, I think:

    prefix : <http://stackoverflow.com/q/24313367/1281433/>
    
    select ?type1 ?p ?type2 (count(distinct *) as ?count) where {
       [ a ?type1 ; ?p [ a ?type2 ] ] 
    }
    group by ?type1 ?p ?type2 
    
    ----------------------------------------------
    | type1  | p              | type2    | count |
    ==============================================
    | :River | :passesThrough | :City    | 3     |
    | :City  | :isCapitalOf   | :Country | 1     |
    | :River | :isPartOf      | :Country | 4     |
    ----------------------------------------------
    

    If you're not too comfortable with the [ … ] blank node syntax, it might help to see the expanded form:

    SELECT  ?type1 ?p ?type2 (count(distinct *) AS ?count)
    WHERE
      { _:b0 rdf:type ?type1 .
        _:b0 ?p _:b1 .
        _:b1 rdf:type ?type2
      }
    GROUP BY ?type1 ?p ?type2
    

    This only catches things that have types, though. If you want to include things that don't have rdf:types, you'd want to do

    SELECT  ?type1 ?p ?type2 (count(distinct *) AS ?count) { 
        ?x ?p ?y
        optional { ?x a ?type1 }
        optional { ?y a ?type2 }
    }
    GROUP BY ?type1 ?p ?type2