Search code examples
rdrake-r-package

R {drake} plan: Read many datasets into single target


I started to use {drake} for a data production pipeline. The raw data I work with is quite large and is split up into ~130 separate (Stata) files. Thus, each file should be processed separately. In order to keep it readable, I use target(), transform() and map() to specify my plan. This looks similar to the code below:

plan <- drake_plan(
    dta_paths = list.files(my_folder, full.names = TRUE),
    dfs = target(
        read.dta13(dta_path),
        transform = map(dta_path = dta_paths)
    )
)

So when I make() the plan, I get the following error:

target dfs_dta_paths

Warning: target dfs_dta_paths warnings:

the condition has length > 1 and only the first element will be used

the condition has length > 1 and only the first element will be used

the condition has length > 1 and only the first element will be used

fail dfs_dta_paths

Error: Target dfs_dta_paths failed. Call diagnose(dfs_dta_paths) for details. Error message:

Expecting a single string value: [type=character; extent=129].

From what I understand from this warning and error messages, the mapping over the different file paths is not working and the full vector is passed to the first function call. I read https://books.ropensci.org/drake/static.html#map but it did not help in figuring out the problem. Also converting the vector of paths to a list did not help.

From How to combine multiple drake targets into a single cross target without combining the datasets? I got the idea of predefining a grid, which actually works as suggested. But since I do only need a vector, not a complex grid, this looks like over-engineering to me.

I feel like I'm missing something obvious, but I can't spot it. Any ideas what's wrong with my code?


I am aware of https://books.ropensci.org/drake/plans.html#how-to-choose-good-targets, but since I want to iterate in the process of data cleaning, I thought it would be helpful to create the dfs target as shown above.


Solution

  • When you use target(transform = ...), it is always a best to visualize the plan before you feed it to make(). It could take a couple iterations to get it right. Here is what your current plan looks like.

    library(drake)
    plan <- drake_plan(
      dta_paths = list.files(my_folder, full.names = TRUE),
      dfs = target(
        read.dta13(dta_path),
        transform = map(dta_path = dta_paths)
      )
    )
    
    plan
    #> # A tibble: 2 x 2
    #>   target        command                                 
    #>   <chr>         <expr>                                  
    #> 1 dta_paths     list.files(my_folder, full.names = TRUE)
    #> 2 dfs_dta_paths read.dta13(dta_paths)
    
    config <- drake_config(plan)
    vis_drake_graph(config)
    

    Created on 2020-01-16 by the reprex package (v0.3.0)

    To read one file per target, I recommend the plan below. See https://books.ropensci.org/drake/static.html#tidy-evaluation for more on why it uses !!.

    library(drake)
    
    # create some faux stata files for the example.
    my_folder <- fs::dir_create("folder")
    file.create("folder/file1.dta")
    #> [1] TRUE
    file.create("folder/file2.dta")
    #> [1] TRUE
    
    # Since you are using static branching (https://books.ropensci.org/drake/static.html)
    # this needs to be defined up front.
    # It does not need to be a target, re https://books.ropensci.org/drake/plans.html#how-to-choose-good-targets
    dta_paths <- list.files(my_folder, full.names = TRUE)
    
    plan <- drake_plan(
      dfs = target(
        # Use !! here to literally insert the path so file_out() can mark it for tracking.
        read.dta13(file_in(!!dta_path)),
        # Use !! here to insert the actual vector of paths instead of the symbol `dta_paths`
        transform = map(dta_path = !!dta_paths)
      )
    )
    
    plan
    #> # A tibble: 2 x 2
    #>   target               command                                
    #>   <chr>                <expr>                                 
    #> 1 dfs_folder.file1.dta read.dta13(file_in("folder/file1.dta"))
    #> 2 dfs_folder.file2.dta read.dta13(file_in("folder/file2.dta"))
    
    config <- drake_config(plan)
    vis_drake_graph(config)
    

    Created on 2020-01-16 by the reprex package (v0.3.0)