I have a question about the best way to set up R targets to track files and update a big dataset.
I've read other posts, including this one, but none answer this question.
To illustrate what I need to accomplish, consider the following Reprex:
Different family members are traveling to different cities. Build a tibble to store this information
city_log <- tibble::tibble(
city = c("new_york", "sf", "tokyo"),
traveler = list(c("Bob", "Mary", "Johnny", "Jenny"),
c("Bob", "Mary", "Jenny"),
c("Johnny", "Jenny"))
)
The goal would be to take this city-based information and convert it to person-based information.
traveler_log_full <- #this is a separate object b/c I need to re-use traveler_log_full
city_log %>%
tidyr::unnest("traveler")
traveler_log <-
traveler_log_full %>%
dplyr::nest_by(traveler, .key = "cities") %>%
dplyr::ungroup() %>%
dplyr::mutate(num_cities = map_dbl(cities, ~ nrow(.x))) #lets summarize the number of cities visited/person
The challenge: an updated dataset
The challenge is that this dataset will be updated often, and I want to use the computation from traveler_log_full to update it, and then remake the final traveler_log with the summary stats
city_log_updated <- tibble::tibble(
city = c("new_york", "sf", "tokyo", "paris"),
traveler = list(c("Bob", "Mary", "Johnny", "Jenny"),
c("Bob", "Mary", "Jenny"),
c("Johnny", "Jenny"),
c("Bob", "Mary"))
)
I could do something like filtering out the old cities, to get only new cities
old_cities <- unique(traveler_log_full$city)
city_log_updated %>%
dplyr::filter(!city %in% old_cities)
Given that I have 7.7M cities and 20,000 travelers, I do not want to recalculate the traveler_log_full each time I get a new city_log_updated
How can I set up R targets to carry out this task?
Thanks in advance for your help!
A targets
pipeline is a directed acyclic graph of immutable dependencies. In other words, once a target completes, it cannot be overwritten by a downstream step in the pipeline. This restriction is essential for reproducibility. Everything that happens to a target needs to happen inside that target's own command. Otherwise, there would be no reliable way to detect all the changes necessary to decide whether to rerun or skip that target.
I might be missing something, but it sounds like the challenge you propose is to update city_log
based on the results computing traveler_log
and/or traveler_log_full
. Unfortunately, this approach is not compatible with the conceptual model of targets
because the graph city_log
--> traveler_log_full
--> traveler_log
--> city_log
is a cycle.
If city_log_updated
can be a different target than city_log
, then you can express the project as a targets
pipeline as follows:
# _targets.R file
library(targets)
tar_source()
tar_option_set(
packages = "tidyverse",
format = "feather" # efficient compressed storage for data frames
)
list(
tar_target(
name = city_log,
command = tibble::tibble(
city = c("new_york", "sf", "tokyo"),
traveler = list(
c("Bob", "Mary", "Johnny", "Jenny"),
c("Bob", "Mary", "Jenny"),
c("Johnny", "Jenny")
)
),
tar_target(
name = traveler_log_full,
command = tidyr::unnest("traveler")
),
tar_target(
name = traveler_log,
command = traveler_log_full %>%
dplyr::nest_by(traveler, .key = "cities") %>%
dplyr::ungroup() %>%
dplyr::mutate(num_cities = map_dbl(cities, ~ nrow(.x)))
),
tar_target(
name = city_log_updated,
command = your_function(traveler_log) # I am not sure what you had in mind here.
)
)