Search code examples
rdplyrtarget

Using R Targets to update a BIG dataset


I have a question about the best way to set up R targets to track files and update a big dataset.

I've read other posts, including this one, but none answer this question.

To illustrate what I need to accomplish, consider the following Reprex:

Different family members are traveling to different cities. Build a tibble to store this information

city_log <- tibble::tibble(
  city = c("new_york", "sf", "tokyo"),
  traveler = list(c("Bob", "Mary", "Johnny", "Jenny"), 
                  c("Bob", "Mary", "Jenny"), 
                  c("Johnny", "Jenny"))
)

The goal would be to take this city-based information and convert it to person-based information.

traveler_log_full <- #this is a separate object b/c I need to re-use traveler_log_full
  city_log %>% 
  tidyr::unnest("traveler") 

traveler_log <- 
  traveler_log_full %>% 
  dplyr::nest_by(traveler, .key = "cities") %>% 
  dplyr::ungroup() %>% 
  dplyr::mutate(num_cities = map_dbl(cities, ~ nrow(.x))) #lets summarize the number of cities visited/person

The challenge: an updated dataset
The challenge is that this dataset will be updated often, and I want to use the computation from traveler_log_full to update it, and then remake the final traveler_log with the summary stats

city_log_updated <- tibble::tibble(
  city = c("new_york", "sf", "tokyo", "paris"),
  traveler = list(c("Bob", "Mary", "Johnny", "Jenny"), 
                  c("Bob", "Mary", "Jenny"), 
                  c("Johnny", "Jenny"), 
                  c("Bob", "Mary"))
)

I could do something like filtering out the old cities, to get only new cities

old_cities <- unique(traveler_log_full$city)

city_log_updated %>% 
  dplyr::filter(!city %in% old_cities)

Given that I have 7.7M cities and 20,000 travelers, I do not want to recalculate the traveler_log_full each time I get a new city_log_updated

How can I set up R targets to carry out this task?

  • I have read all the documentation on targets/targetopia.
  • I do not want to use dynamic branching, becuase if the dynamic branches change, then I will have to regenerate all of the intermediate targets.
  • I considered static branching via tar_map(), but there are no values that I would use for iteration.
  • I think the ideal would be to manually take big file (7.7 M cities) and break it into 10 small files (manually assign idx?), and map along those.
  • Then, when an updated dataset arrives, try to a create new file just with the new cities.
  • An added challenge is that city_log_updated is technically called city_log, same as the first. So if this gets updated with a new file, then targets will trigger the generation of all of the intermediate objects too.

Thanks in advance for your help!


Solution

  • A targets pipeline is a directed acyclic graph of immutable dependencies. In other words, once a target completes, it cannot be overwritten by a downstream step in the pipeline. This restriction is essential for reproducibility. Everything that happens to a target needs to happen inside that target's own command. Otherwise, there would be no reliable way to detect all the changes necessary to decide whether to rerun or skip that target.

    I might be missing something, but it sounds like the challenge you propose is to update city_log based on the results computing traveler_log and/or traveler_log_full. Unfortunately, this approach is not compatible with the conceptual model of targets because the graph city_log --> traveler_log_full --> traveler_log --> city_log is a cycle.

    If city_log_updated can be a different target than city_log, then you can express the project as a targets pipeline as follows:

    # _targets.R file
    library(targets)
    tar_source()
    tar_option_set(
      packages = "tidyverse",
      format = "feather" # efficient compressed storage for data frames
    )
    
    list(
      tar_target(
        name = city_log,
        command = tibble::tibble(
          city = c("new_york", "sf", "tokyo"),
          traveler = list(
            c("Bob", "Mary", "Johnny", "Jenny"), 
            c("Bob", "Mary", "Jenny"), 
            c("Johnny", "Jenny")
          )
      ),
      tar_target(
        name = traveler_log_full,
        command = tidyr::unnest("traveler")
      ),
      tar_target(
        name = traveler_log,
        command = traveler_log_full %>% 
          dplyr::nest_by(traveler, .key = "cities") %>% 
          dplyr::ungroup() %>% 
          dplyr::mutate(num_cities = map_dbl(cities, ~ nrow(.x)))
      ),
      tar_target(
        name = city_log_updated,
        command = your_function(traveler_log) # I am not sure what you had in mind here.
      )
    )