Search code examples
rparallel-processingdownloadfurrr

How can I configure future to download more files?


I have a lot of files I need to download.

I am using download.file() function and furrr::map to download in parallel, with plan(strategy = "multicore").

Please advise how can I load more jobs for each future?

Running on Ubuntu 18.04 with 8 cores. R version 3.5.3.

The files can be txt, zip or any other format. Size varies in range of 5MB - 40MB each.


Solution

  • Using furrr works just fine. I think what you mean is furrr::future_map. Using multicore substantially increases the downloading speed (Note: on Windows, multicore is not available, only multisession. Use multiprocess if you are unsure what platform your code will be run on).

    library(furrr)
    #> Loading required package: future
    
    csv_file <- "https://raw.githubusercontent.com/UofTCoders/rcourse/master/data/iris.csv"
    
    download_template <- function(.x) {
        temp_file <- tempfile(pattern = paste0("dl-", .x, "-"), fileext = ".csv")
        download.file(url = csv_file, destfile = temp_file)
    }
    
    download_normal <- function() {
        for (i in 1:5) {
            download_template(i)
        }
    }
    
    download_future_core <- function() {
        plan(multicore)
        future_map(1:5, download_template)
    }
    
    download_future_session <- function() {
        plan(multisession)
        future_map(1:5, download_template)
    }
    
    library(microbenchmark)
    
    microbenchmark(
        download_normal(),
        download_future_core(),
        download_future_session(),
        times = 3
    )
    #> Unit: milliseconds
    #>                       expr       min        lq      mean    median
    #>          download_normal()  931.2587  935.0187  937.2114  938.7787
    #>     download_future_core()  433.0860  435.1674  488.5806  437.2489
    #>  download_future_session() 1894.1569 1903.4256 1919.1105 1912.6942
    #>         uq       max neval
    #>   940.1877  941.5968     3
    #>   516.3279  595.4069     3
    #>  1931.5873 1950.4803     3
    

    Created on 2019-03-25 by the reprex package (v0.2.1)

    Keep in mind, I am using Ubuntu, so using Windows will likely change things, since as far as I understand future doesn't allow multicore on Windows.

    I am just guessing here, but the reason that multisession is slower could be because it has to open up several R sessions before running the download.file function. I was just downloading a very small dataset (iris.csv), so maybe on larger datasets that take more time, the time taken to open an R session would be offset by the time it takes to download larger files.

    Minor update:

    You can pass a vector of URLs to the datasets into future_map so it downloads each file as determined by the future package processing:

    data_urls <- c("https:.../data.csv", "https:.../data2.csv")
    library(furrr)
    plan(multiprocess)
    future_map(data_urls, download.file)
    # Or use walk 
    # future_walk(data_urls, download.file)