Search code examples
rtreeigraph

Pruning / Retrieving Nodes from a Tree Based on Parent / Child / Sibling Relationships


I have a tree structure with uniquely identified nodes, a root, and a bunch of branches. There are no restrictions on how many branches any node can sprout or how deep they can go.

What I'd like to do is, given the id of a node, efficiently return the ids of all the parents of that node, as well as all the siblings of each of those parents. What I'm hoping is that the answer to this specific question will use packages/functions designed for this type of manipulation so that I may learn about them and use them for future similar problems rather than have to write custom algorithms every time.

Here is the tree in child-parent form:

structure(list(child = 1:20, parent = c(NA, 1L, 2L, 3L, 1L, 5L, 6L, 5L, 8L, 1L, 1L, 11L, 12L, 11L, 14L, 14L, 1L, 17L, 18L, 19L)), .Names = c("child", "parent"), class = "data.frame", row.names = c(NA, -20L))

And here is a visual representation:

enter image description here

For the specific problem my target id is 13 (in red), and would like to get the ids of the parent nodes (in orange), as well those of the siblings of the parents (in yellow). I can definitely come up with an algorithm to do this (getting all parents is easy, and then siblings means getting each parent's parent's children), but I am really hoping there already exists a framework to do all this efficiently.

I looked briefly into rpart, tree, and ape but as far as I can tell from my brief review they are targeted at statistical classification of trees rather than manipulation / retrieval of nodes, and it seemed to me based on a somewhat perfunctory review that they don't offer the functionality I'm after (entirely likely I'm just missing it / misinterpreting it).

The other option seems to be the XML or xmlTree that would allow me to use some of the DOM manipulation tools to achieve my objectives, but it seems a bit heavy handed and perhaps not optimal.


Solution

  • I've found myself in similar situations numerous times, and I've usually just rolled my own solution. I agree that the XML packages are overkill for focused problems such as this. The iGraph package is nice, but beware. It is an interface to a C library and is zero-indexed (unlike R which is 1-indexed). I've used it successfully, but there always seems to be a painful debugging process in which I keep finding "unshifted" indexes. (e.g. foo[index] has to change to foo[index-1].)

    Again, I'd probable just roll my own. Something like:

    foo <- structure(list(child = 1:20, parent = c(NA, 1L, 2L, 3L, 1L, 5L, 6L, 5L, 8L, 1L, 1L, 11L, 12L, 11L, 14L, 14L, 1L, 17L, 18L, 19L)), .Names = c("child", "parent"), class = "data.frame", row.names = c(NA, -20L))
    
    get_ancestry <- function(node, data=foo, ancestry=c()) {
        parent <- data$parent[data$child %in% node]
        if (is.na(parent)) {
            return(ancestry)
        } else {
            get_ancestry(parent, data, c(ancestry, parent))
        }
    }
    
    get_children <- function(nodes, data=foo) {
        unlist(lapply(nodes, function(x) {data$child[data$parent %in% x]}))
    }
    
    orange <- get_ancestry(13)
    yellow <- get_children(orange)