Search code examples
rdplyrtidyr

matching metadata on multiple nested data frames


I am pulling indicator data from an API which gives me back data frames containing nested data frames of varying sizes to hold additional metadata. I am trying to rectangle it, but without much luck.

Each indicator is 'tagged' with a number of groups (up to 15, out of a possible 100), both with group name and group id. Each group can be further categorised under a range of group sets.

I'm looking to transform my indicators data frame to add additional columns, with the names of the group sets, and the values being the corresponding group name for that indicator, or NA if not relevant.

The data comes from the API like this:

library(tibble)
indicators <- tibble::tibble(
  name = c("Number of lettuces", "Number of oranges"),
  groups = list(
    data.frame(
      name = c("Number"), 
      id = c("uid001")),
    data.frame(
      name = c("Oranges", "Citrus", "Number"),
      id = c("uid003", "uid004", "uid001")
    )
  )
)
indicator_groups <- tibble::tibble(
  name = c("Number", "Oranges", "Citrus"),
  id = c("uid001", "uid003", "uid004"),
  sets = list(
    data.frame(name = c("Unit_of_measure"), id = c("gid003")),
    data.frame(name = c("Fruit"), id = c("gid001")),
    data.frame(name = c("Fruit", "Type"), id = c("gid001", "gid002"))
  )
)
indicator_group_sets <- tibble::tibble(
  name = c("Fruit", "Type", "Unit_of_measure"),
  id = c("gid001", "gid002", "gid003"),
  groups = list(
    data.frame(id = c("uid003", "uid004")),
    data.frame(id = c("uid003", "uid004")),
    data.frame(id = c("uid001"))
  )
)

I'm sure some combination of dplyr::left_join() and/or tidyr::hoist() should be able to do it, but I'm struggling to understand how. I know I should only need one of indicator_groups or indicator_group_sets.

In previous attempts, I've managed to get to one column per group (but not per group set), and/or one row per indicator/group combination and lots of NAs which I haven't been able to reduce.

Desired result:

indicators
#>                 name   Fruit   Type Unit_of_measure
#> 1 Number of lettuces    <NA>   <NA>          Number
#> 2  Number of oranges Oranges Citrus          Number

Can anyone give me pointers to how to sensibly go about this please?


Solution

  • I presume there's something more elegant, but this seems to work. It's a little tricky since each of your tables has multiple fields called "name."

    library(tidyverse)
    indicators_tidy = indicators |>
      unnest_longer(groups) |>
      unnest_wider(groups, names_sep = "_")
    
    indicator_groups_tidy <- indicator_groups |>
      unnest_longer(sets) |>
      unnest_wider(sets, names_sep = "_")
    
    indicators_tidy |>
      left_join(indicator_groups_tidy,
                join_by(groups_name == name, groups_id == id))
    

    Result

    # A tibble: 5 × 5
      name               groups_name groups_id sets_name       sets_id
      <chr>              <chr>       <chr>     <chr>           <chr>  
    1 Number of lettuces Number      uid001    Unit_of_measure gid003 
    2 Number of oranges  Oranges     uid003    Fruit           gid001 
    3 Number of oranges  Citrus      uid004    Fruit           gid001 
    4 Number of oranges  Citrus      uid004    Type            gid002 
    5 Number of oranges  Number      uid001    Unit_of_measure gid003