Search code examples
ryaml

Turning a data frame column to an array of dictionaries in YAML


I'm trying to generate the following YAML structure from tabular data:

- name: Josiah Carberry
  roles: 
    - investigation: lead 
    - data curation: supporting

I'm struggling with the structure of the roles key. It's basically an array of dictionaries which would translate to a list of data frames in R.

My issue is that I can't figure out how to store such lists of data frames in a way that will produce the same output as in the example above.

Here's my attempt:

library(tibble)

tibble(
  id = paste0("id", 1:2),
  roles = list(
    list(tibble(writing = "lead"), tibble(supervision = "supporting")),
    list(tibble(writing = "equal"))
  )
) |> 
  jsonlite::toJSON() |> 
  jsonlite::parse_json() |> 
  yaml::as.yaml(indent.mapping.sequence = TRUE) |> 
  cat()

Which produces:

- id: id1
  roles:
    - - writing: lead
    - - supervision: supporting
- id: id2
  roles:
    - - writing: equal

As you can see there's one extra dash before each role because of the outer list I use to store the data frames.

Any idea how I could get the following?

- id: id1
  roles:
    - writing: lead
    - supervision: supporting
- id: id2
  roles:
    - writing: equal

Solution

  • Option 1: regex

    I don't know a way for yaml::as.yaml to do it correctly the first time, but you can always do a simple gsub to change all - - manually:

    tibble(
      id = paste0("id", 1:2),
      roles = list(
        list(tibble(writing = "lead"), tibble(supervision = "supporting")),
        list(tibble(writing = "equal"))
      )
    ) |> 
      jsonlite::toJSON() |> 
      jsonlite::parse_json() |> 
      yaml::as.yaml(indent.mapping.sequence = TRUE) |>
      gsub("- - ", "- ", x = _) |>
      cat()
    # - id: id1
    #   roles:
    #     - writing: lead
    #     - supervision: supporting
    # - id: id2
    #   roles:
    #     - writing: equal
    

    If there's a risk you could have a legitimate - - embedded within one of your items, you can make the gsub a bit more specific:

      gsub("((^|\n) *)- - ", "\\1- ", x = _)
    

    where

    • (^|\n) matches either the string beginninf (^) or the embedded newline; this is necessary because as.yaml returns a single string with \n embedded within it, so we cannot rely on ^ to catch them all; we could always use strsplit(_, "\n")[[1]] to split it then gsub then recombine, but that seems unnecessary given we can do it in one step
    • ((^|\n) *) finds the above plus zero or more spaces; I think " " is safe here instead of "\\s", since I believe as.yaml is always going to put out spaces there; because this is wrapped in parens, it will be available as a pattern group for later recall
    • "\\1- " replaces everything in the pattern (including the - -) with the parenthesized pattern group (previous bullet) and a single hyphen-space.

    Option 2: convert from tibble to list:

    Up front, this fixes the - - problem with the above, though it converts inline dictionaries with nested. Formally, they resolve to the same underlying structure, so if you're okay with the added verbosity then perhaps this is better:

    tibble(
      id = paste0("id", 1:2),
      roles = list(
        list(tibble(writing = "lead"), tibble(supervision = "supporting")),
        list(tibble(writing = "equal"))
      )
    ) |> 
      transform(roles = rapply(roles, unlist, how = "list")) |>
      jsonlite::toJSON() |> 
      jsonlite::parse_json() |> 
      yaml::as.yaml(indent.mapping.sequence = TRUE) |> 
      cat()
    # - id: id1
    #   roles:
    #     - writing:
    #         - lead
    #     - supervision:
    #         - supporting
    # - id: id2
    #   roles:
    #     - writing:
    #         - equal
    

    This looks slightly different than your expected, though they are still valid yaml dictionaries. The effective result is confirmed with:

    out1 <- tibble(
      id = paste0("id", 1:2),
      roles = list(
        list(tibble(writing = "lead"), tibble(supervision = "supporting")),
        list(tibble(writing = "equal"))
      )
    ) |> 
      jsonlite::toJSON() |> 
      jsonlite::parse_json() |> 
      yaml::as.yaml(indent.mapping.sequence = TRUE) |>
      gsub("- - ", "- ", x = _)
    out2 <- tibble(
      id = paste0("id", 1:2),
      roles = list(
        list(tibble(writing = "lead"), tibble(supervision = "supporting")),
        list(tibble(writing = "equal"))
      )
    ) |> 
      transform(roles = rapply(roles, unlist, how = "list")) |>
      jsonlite::toJSON() |> 
      jsonlite::parse_json() |> 
      yaml::as.yaml(indent.mapping.sequence = TRUE)
    
    out1
    # [1] "- id: id1\n  roles:\n    - writing: lead\n    - supervision: supporting\n- id: id2\n  roles:\n    - writing: equal\n"
    out2
    # [1] "- id: id1\n  roles:\n    - writing:\n        - lead\n    - supervision:\n        - supporting\n- id: id2\n  roles:\n    - writing:\n        - equal\n"
    identical(yaml::read_yaml(text = out1), yaml::read_yaml(text = out2))
    # [1] TRUE