Search code examples
rtargets-r-package

specify group_by variables programmatically in targets workflow with static branching


I'm using the targets package in R to organize data processing and analysis for some studies that involve multiple different data sources (surveys, performance metrics on different tasks, etc). Because some of the tasks produce large quantities of data, I have to use dynamic branching to limit how much data I'm trying to load into R at any given time.

I've started using static branching as well to better organize the workflow, but I'm running into an issue, and I'm hoping there is a work-around. I would like to be able to define the variables I want to group by in tar_group_by or tar_group_select in my list/tibble of parameters to send to tar_map, as seen in the following reprex:

library(targets)

targets::tar_dir({
  targets::tar_script({
    library(targets)
    library(tarchetypes)
    library(tidyverse)
    
    df1 <- data.frame(
      g1 = rep(LETTERS[1:3], 2),
      x = 1:6,
      y = 1:6
    ) |>
      group_by(g1)
    
    df2 <- data.frame(
      g1 = rep(LETTERS[1:3], 2),
      g2 = rep(LETTERS[4:6], 2),
      x = 1:6,
      y = 1:6
    ) |>
      group_by(g2)
    
    list(
      label = c("group1", "group2"),
      source = c("df1", "df2") |> syms(),
      by = c("g1", 'g2')  # group by g1 for df1 and g2 for df2
    ) |>
      tar_map(
        names = "label",
        tar_group_select(
          result,
          identity(source),
          # by = 'g1' # works (because g1 is defined in both data sets)
          by = by
        )
      )
  })
  
  tar_manifest()
  tar_make()
})
#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
#> ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
#> ✔ tibble  3.1.7     ✔ dplyr   1.0.9
#> ✔ tidyr   1.2.0     ✔ stringr 1.4.0
#> ✔ readr   2.1.2     ✔ forcats 0.5.1
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
#> ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
#> ✔ tibble  3.1.7     ✔ dplyr   1.0.9
#> ✔ tidyr   1.2.0     ✔ stringr 1.4.0
#> ✔ readr   2.1.2     ✔ forcats 0.5.1
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> • start target result_group1
#> ✖ error target result_group1
#> • end pipeline: 0.105 seconds
#> Error : tar_group() must take a grouped data frame from dplyr::group_by()
#> ✖ Problem with the pipeline.
#> ℹ Show errors: tar_meta(fields = error, complete_only = TRUE)
#> ℹ Learn more: https://books.ropensci.org/targets/debugging.html
#> Error:
#> ! problem with the pipeline.

Created on 2022-08-04 by the reprex package (v2.0.1)

Replacing by = by with by = g1 or by = 'g1' works just fine, so this seems to be an issue where tar_map isn't looking to my list to get a result for by. Any solutions/work-arounds that don't involve hand-coding all my tar_group_* statements?

(I usually use tar_group_by when they are not in a static branch. I was hoping tar_group_select would allow me to pass my group variables in the context of tar_map, but that doesn't seem to be working.)


Solution

  • tar_map() works on the command and pattern settings of regular target objects, but it doesn't see the by argument of special target factories like tar_group_select(). So instead, I recommend something tar_eval(), which straightforwardly substitutes symbols in the expression you give it.

    # _targets.R file
    library(targets)
    library(tarchetypes)
    library(tidyverse)
    library(rlang)
    
    df1 <- data.frame(
      g1 = rep(LETTERS[1:3], 2),
      x = 1:6,
      y = 1:6
    ) |>
      group_by(g1)
    
    df2 <- data.frame(
      g1 = rep(LETTERS[1:3], 2),
      g2 = rep(LETTERS[4:6], 2),
      x = 1:6,
      y = 1:6
    ) |>
      group_by(g2)
    
    tar_eval(
      values = list(
        name = c("result1", "result2"),
        label = c("group1", "group2"),
        source = c("df1", "df2") |> syms(),
        by = c("g1", "g2")  # group by g1 for df1 and g2 for df2
      ),
      tar_group_select(
        name = name,
        command = identity(source),
        by = all_of(by)
      )
    )