Search code examples
treesparqlrdf

SPARQL query for collapsing branches of a tree (summarizing topology)


Let's say we have the same tree of this question

enter image description here

A, D, E and H are 'special' nodes belonging to the class :Special. Here is the tree definition

@prefix : <http://example.org#> .

:orgA :hasSuborganization :orgB, :orgC, :orgD.
:orgB :hasSuborganization :orgE, :orgF.
:orgE :hasSuborganization :orgG.
:orgG :hasSuborganization :orgH.
:orgA a :Special .
:orgD a :Special .
:orgE a :Special .
:orgH a :Special .

I'd like to get the same tree of the starting one but with only special nodes. A kind of summary of the starting topology. E.g., Expected output:

  A
 _|_
|   |
E   D
|
H

I'd like to get that with a SPARQL query. My starting point:

@prefix : <http://example.org#> .

select ?node ?node2 (count(?mid) as ?distance) where {
    ?node :hasSuborganization* ?mid .
    ?mid :hasSuborganization+ ?node2 .
    ?node2 a :Special .
    {
        select * where { 
            <http://example.org#orgA> :hasSuborganization* ?node .
            ?node a :Special .
        }
    }
} group by ?node ?node2

In this way I get the distance of each couple of special nodes in the tree.

How can I filter just the super-sub relations (i.e., A-D,A-E,E-H)? I believed that filtering the rows with the minimum values in my result-set was enough. Actually, it fails if one :Special node has :Special descendants at different height (e.g., distance(A-D) = 1, distance(A-E) = 2).

Probably I need something different.


Solution

  • Reasoning on AKSW clues in comments, a possible solution could be this one:

    @prefix : <http://example.org#> .
    
    select * where {
        ?node :hasSuborganization+ ?end .
        ?end  a :Special .
        FILTER NOT EXISTS {
            ?node :hasSuborganization+ ?mid .
            ?mid :hasSuborganization+ ?end .
            ?mid a :Special .
        }
        {
            select * where { 
                :orgA :hasSuborganization* ?node .
                ?node a :Special .
            }
        }
    }
    

    Explanation:

    The innermost query returns all :Special nodes from a root node (i.e., :orgA).

    select * where { 
        :orgA :hasSuborganization* ?node .
        ?node a :Special .
    }
    

    Then, the outer query selects all possible ?node :hasSuborganization+ ?end patterns. For example, for ?node = :orgA we get: A-D,A-E,E-H.

    Finally, the outer query filters out patterns with a :Special intermediate node (i.e., ?mid)

    FILTER NOT EXISTS {
        ?node :hasSuborganization+ ?mid .
        ?mid :hasSuborganization+ ?end .
        ?mid a :Special .
    }
    

    The final resultset is a collection of <?node, ?end> couples for building this summarized tree:

      A
     _|_
    |   |
    E   D
    |
    H
    

    This query works fine, even it does not scale very well when the tree becomes huge. Optimizations or different strategies could be possible.