Search code examples
rdataframedata-structurestree

Transform a dataframe into a multi-nested data frame


Good evening,

I am attempting to create a function that takes an input data frame (example data using Supreme Court justices provided below) and restructures it into a multi-generational nested data frame. The data is structured for use in an eCharts4r tree visualization.

tree_data <- data.frame(
      name = c("John Jay", "John Rutledge", "William Cushing", "James Wilson", "John Blair", "James Iredell", "Thomas Johnson", "William Paterson", "Samuel Chase", "Oliver Ellsworth", "Bushrod Washington", "Alfred Moore", "John Marshall", "William Johnson", "John McLean", "Levi Woodbery", "William Strong"),
      big = c(NA, "John Jay", "John Jay", "William Cushing", "William Cushing", "John Rutledge", "John Rutledge", "John Rutledge", "John Jay", "Samuel Chase", "James Iredell", "William Cushing", "John Blair", "Bushrod Washington", "Bushrod Washington", "John McLean", "William Johnson"),
      pledgeClass = c("Founder","A", "A", "B", "B", "B", "B", "B", "B", "C", "C", "C", "C", "D", "D", "E", "E"),
      alumniStatus = c(TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE))

The initial data frame created includes information on the founder, his name, and a column that links to a sub-data frame that includes information on all of his "littles" and all of their information. Each of those sub-data frames will contain information on those individuals and also, should they have a little, a reference to another sub-data frame. If not, a "NA" in its stead. This process repeats itself throughout the tree and provides a "NA" when there is no further path down a branch.

Here is the best I've been able to do so far -- with some assistance from ChatGPT. This data is based on the concept of a fraternity/sorority family tree.

create_nested_structure <- function(data) {
  nested_data <- data %>%
    group_by(big) %>%
    summarise(little = list(data.frame(little = unique(name))))

  nested_data$little <- map(nested_data$little, ~ .x)

  return(nested_data)
}

I'd greatly appreciate any assistance, and I appreciate you taking the time for reading this post!

Edit Update

Thanks Mark for the help already. Here is the modified code I used, per our comments below.

create_nested_structure <- function(data, big_boy) {
  data %>% 
    filter(big %in% big_boy) -> d
  if (nrow(d) == 0) {
    return(as.data.frame(NA))
  } else {
    return(mutate(d, children = map(name, ~create_nested_structure(data, .x))))
  }
}

To give you an idea of the output. This is one of the nested data frames whenever an NA is present with a would-be tibble:

Davie Jones,Jimmy Legs,D,TRUE,NA
Davie Jones, Will Turner, H, TRUE, list(big = "Will Turner", name = "Henry Turner", [...]

The last column got turned into a "boolean" column because of the NA's inclusion. Whenever an NA is included the tibbles turn into lists. Whenever I use the modified function, it creates a tibble that, when clicked on, opens up to an NA tibble that has one column and one row with a value of NA. This shows up in the visualization, undesirably.


Solution

  • You can do this:

    library(tidyverse)
    
    create_nested_structure <- function(data, big_boy) {
    # find the row with big boy in it
      data %>% 
        filter(big %in% big_boy) -> d # I use %in% instead of == because NA == NA evaluates to NA, whereas NA %in% NA evaluates to TRUE
      
      # if no rows are found (i.e. big boy is the parent to no-one, return NA
      if (nrow(d) == 0) {
        return(NA)
      } else {
        # else, return the dataframe with the children column being the function run recursively, to find the children of each child (and so on, all the way to the bottom)
        return(mutate(d, children = map(name, ~create_nested_structure(data, .x))))
      }
    }
    tree <- create_nested_structure(tree_data, NA)
    
    # Output:
    # A tibble: 1 × 5
      name     big   pledgeClass alumniStatus children        
      <chr>    <chr> <chr>       <lgl>        <list>          
    1 John Jay NA    Founder     TRUE         <tibble [3 × 5]>
    

    I think the main problem with your code is that you don't call the create_nested_structure within itself, so it has no way of recursing.


    Update: to create the tree:

    tree %>%
      e_charts() %>%
      e_tree() %>%
      e_title("Supreme Court Justices")
    

    plot