Search code examples
sparqlrdfamazon-neptune

Selecting connected vertices on a dependency tree and ordering them by depth (SPARQL)


I’m using SPARQL for RDF and am having trouble coming up with a query that will allow me to select a set of vertices on a tree, and have all connected (direct and transitive) vertices to those selected be returned, and in the order of depth that they exist on the dependency tree (deepest vertices first).

Here is a visual representation of the graph (tree) I’m working with, which contains nodes of RDF type :dependency. In this tree, dependencies have outgoing edges to each dependency that they require, and this is represented on the graph using a :requires edge.

enter image description here

Here is the data that creates this graph:

INSERT DATA {
   :A             rdf:type   :dependency .
   :B             rdf:type   :dependency .
   :C             rdf:type   :dependency .
   :D             rdf:type   :dependency .
   :E             rdf:type   :dependency .
   :F             rdf:type   :dependency .
   :G             rdf:type   :dependency .
   :H             rdf:type   :dependency .
   :I             rdf:type   :dependency .
   :J             rdf:type   :dependency .
   :K             rdf:type   :dependency .

   :A             :requires  :B .
   :A             :requires  :C .
   :A             :requires  :I .
   :B             :requires  :J .
   :B             :requires  :E .
   :C             :requires  :D .
   :E             :requires  :F .
   :E             :requires  :G .
   :D             :requires  :G . 
   :H             :requires  :I .     
   :H             :requires  :K .
}

I would like to form a query that will allow me to select a set of dependencies, and return all dependencies of that selection in an order in which the lowest :required dependencies are returned first, such that if "Dependency X" :requires "Dependency Y", then Dependency X must come after Dependency Y in the result set.

Essentially, I want to ask the graph:

"Give me all vertices that are required by this set of vertices, and return them in the order so that the lowest level dependencies are prioritized first on the result set"

For example, in the case of specifying this selection on the above graph: [B, H]

A valid result set would be: [G, F, E, J, B, I, K, H]

(Since G and F are the lowest dependencies of B, followed by E, J and so on…)

The following result set would be also considered valid: [J, F, G, E, B, I, K, H]

since it does not matter whether J is returned before or after F, G, and H as it doesn't depend on them (all that matters is it is returned before B, since B requires all of them)

So far, I have attempted this query for the selection [B, H], which is able to return all required dependencies of B and H, but they are not returned in the order of lowest dependency priority.

SELECT ?requires {
    ?dependency a :dependency .
    FILTER (?dependency IN (:B, :H))
    ?dependency :requires+ ?requires .
}

which returns the result set in the incorrect order: [E, F, G, J, I, K]

Anyone have any ideas how I could construct a query that essentially performs a depth-first search ordering of the vertices?


Solution

  • Thanks to this answer shared by @UninformedUser, I can get a result set that is close enough to what I’m looking for. Basically we can use property paths to determine all :requires edges that are underneath a set of specified vertices, while keeping a counter that will return the depth of each of those edges:

    If we want to determine the order for this selection [B, H], then we can use this query, which is a slightly modified version of the query in the above answer, to allow for the selection of the beginning vertices instead of selecting all paths from the top:

    select ?begin ?midI ?midJ (count(?counter) as ?position) ?end where {
      ?begin a :dependency .
      FILTER (?begin IN (:B, :H))
      ?begin :requires* ?counter .
      ?counter :requires* ?midI .
    
      ?midI :requires ?midJ .
    
      ?midJ :requires* ?end .
      FILTER NOT EXISTS { ?end :requires [] }
    }
    group by ?begin ?end ?midI ?midJ 
    order by DESC(?position) ?begin ?end 
    

    This returns the following result set:

    ----------------------------------------
    | begin | midI | midJ | position | end |
    ========================================
    | :B    | :E   | :F   | 2        | :F  |
    | :B    | :E   | :G   | 2        | :G  |
    | :B    | :B   | :E   | 1        | :F  |
    | :B    | :B   | :E   | 1        | :G  |
    | :B    | :B   | :J   | 1        | :J  |
    | :H    | :H   | :I   | 1        | :I  |
    | :H    | :H   | :K   | 1        | :K  |
    ----------------------------------------
    

    This is sufficient to allow me to perform the remainder of the sorting with code on the client side, where for each unique value in the begin column, collect all values of midJ (starting from top to bottom), until all rows of the same begin value are exhausted, then finally include the value in the begin column. When doing that for the above result set, we get a valid answer: [F, G, E, B, J, I, K, H]