Search code examples
rurldownloadwget

Download specific files from url in r


I would like to download multiple files (around 2000) from this url : https://www.star.nesdis.noaa.gov/pub/corp/scsb/wguo/data/Blended_VH_4km/geo_TIFF/

However, to limit time and space, I would like to download only the files that contain the name VCI.tif and only the years between 1981 - 2011.

I used wget on bash but could not find a way to select what I want. Additionally, the space consumed is huge (more than 140G).

Thank you !


Solution

  • The following uses wget and it works at least with the first 2 files, I have tested the downloads of a (very) small subset of the wanted files.

    suppressPackageStartupMessages({
      library(httr)
      library(rvest)
      library(dplyr)
      library(stringr)
    })
    
    # big files need greater timeout values,
    # since I'm using wget this is probably
    # unnecessary
    old_timeout <- options(timeout = 300)
    getOption("timeout")
    
    year_start <- 1981
    year_end <- 2011
    download_dir <- "~/Temp/"
    wget_cmd_line <- c("-P", download_dir, "")
    
    link <- "https://www.star.nesdis.noaa.gov/pub/corp/scsb/wguo/data/Blended_VH_4km/geo_TIFF/"
    page <- read_html(link)
    
    files_urls <- page %>%
      html_elements("a") %>%
      html_attr("href")
    
    wanted_urls <- files_urls %>%
      str_extract(pattern = "^.*\\.VCI\\.tif$") %>%
      na.omit() %>%
      data.frame(filename = .) %>% 
      mutate(year = str_extract(filename, "\\d{7}"),
             year = str_extract(year, "^\\d{4}"),
             year = as.integer(year)) %>%
      filter(year >= year_start & year <= year_end)
    
    wanted_urls %>%
      #
      # to test the code I only download 2 files;
      # comment out this instruction to download all of them
      head(n = 2) %>%
      #
      pull(filename) %>%
      lapply(\(x) {
        wget_cmd <- wget_cmd_line
        wget_cmd[3] <- paste0(link, x)
        system2("wget", args = wget_cmd, stdout = TRUE, stderr = TRUE)
      })
    
    # put saved value back
    options(old_timeout)