Search code examples
rdrake-r-package

dynamic branching: Define the order of targets into single plan


Reading the documentation of the drake package, I found no other way to define the order of the targets without the use of 'file_in' and 'file_out'.

file_in() marks individual files (and whole directories) that your targets depend on.

file_out() marks individual files (and whole directories) that your targets create.

It is not possible, however, to use both with dynamic targets.

So how can I define an order that should be followed between dynamic targets? I also tried to use make(plan, targets = c("ftp_list", "download.dbc", "dbc_list", "generate_parquet")), but it didn't work

In the code below, for example, I have four targets. What I'd like (order):

  1. Get ftp list from the server
  2. Download the first file from the ftp list (few space in the hd to download all)
  3. Get the downloaded file
  4. Convert as .parquet (and then, start over. download the second file, convert to parquet...)

Any idea how I can link dynamic targets without using file_in and file_out (not allowed in this case)? Thanks!

Code just as example:

URL <- "ftp://ftp.url"
LOCAL_PATH <- paste0(getwd())

plan <- drake_plan(

  ftp_list = obtain_filenames_from_url(url_ = URL, 
                                       remove_extension_from_filename_ = FALSE,
                                       full_names = TRUE)[0:10],

  download.dbc = target(download_dbc(ftp_list, 
                                local_path = paste0(LOCAL_PATH, "/")), 
                   dynamic = map(ftp_list)),

  dbc_list = target(list.files(LOCAL_PATH, full.names = TRUE, 
                               pattern = "*.dbc")),

  generate_parquet = target(convert_dbc(dbc_list, delete_dbc_after_conversion = TRUE),  
                            dynamic = map(dbc_list))
)

plan graph output:

enter image description here


Solution

  • Target order

    file_in() and file_out() are only necessary when you actually need to work with files, directories, or URLs. drake targets are R objects, and target order is determined by how targets are mentioned in commands. drake reads your commands and functions with static code analysis to resolve target order. In the plan below, targets a, b, and c are in an arbitrary order, but drake runs them in the correct order because of how the symbols are mentioned.

    library(drake)
    
    plan <- drake_plan(
      c = head(b),
      a = mtcars[, seq_len(3)],
      b = tail(a)
    )
    
    plot(plan)
    

    
    make(plan)
    #> target a
    #> target b
    #> target c
    
    readd(c) # Targets are R objects
    #>                 mpg cyl  disp
    #> Porsche 914-2  26.0   4 120.3
    #> Lotus Europa   30.4   4  95.1
    #> Ford Pantera L 15.8   8 351.0
    #> Ferrari Dino   19.7   6 145.0
    #> Maserati Bora  15.0   8 301.0
    #> Volvo 142E     21.4   4 121.0
    

    Created on 2020-02-07 by the reprex package (v0.3.0)

    Your plan

    Here are some things that could help your current plan.

    1. Use file_in() on ftp://ftp.url to detect when ftp_list should update.
    2. Define a function (say, get_dbc()) to download some files (part of the ftp_list) and read them into memory.
    3. Skip converting to Parquet. Instead, return data frames as the sub-targets' values. Then, drake will automatically store those data frames in fst files.

    Related:

    Sketch:

    get_dbc_data_frame <- function(ftp_list_entry) {
      # 1. Download the files from the ftp_list_entry.
      # 2. Read them into memory.
      # 3. Return a data frame.
    }
    
    plan <- drake_plan(
      ftp_list = obtain_filenames_from_url(
        url_ = file_in("ftp://ftp.url"), 
        remove_extension_from_filename_ = FALSE,
        full_names = TRUE
      )[seq(0, 10)],
      dbc_data = target(
        get_dbc_data_frame(ftp_list, local_path = paste0(getwd(), "/")),
        format = "fst", # Tell drake to store the data frame as an fst file.
        dynamic = map(ftp_list)
      )
    )