I would like to download multiple files (around 2000) from this url : https://www.star.nesdis.noaa.gov/pub/corp/scsb/wguo/data/Blended_VH_4km/geo_TIFF/
However, to limit time and space, I would like to download only the files that contain the name VCI.tif and only the years between 1981 - 2011.
I used wget on bash but could not find a way to select what I want. Additionally, the space consumed is huge (more than 140G).
Thank you !
The following uses wget
and it works at least with the first 2 files, I have tested the downloads of a (very) small subset of the wanted files.
suppressPackageStartupMessages({
library(httr)
library(rvest)
library(dplyr)
library(stringr)
})
# big files need greater timeout values,
# since I'm using wget this is probably
# unnecessary
old_timeout <- options(timeout = 300)
getOption("timeout")
year_start <- 1981
year_end <- 2011
download_dir <- "~/Temp/"
wget_cmd_line <- c("-P", download_dir, "")
link <- "https://www.star.nesdis.noaa.gov/pub/corp/scsb/wguo/data/Blended_VH_4km/geo_TIFF/"
page <- read_html(link)
files_urls <- page %>%
html_elements("a") %>%
html_attr("href")
wanted_urls <- files_urls %>%
str_extract(pattern = "^.*\\.VCI\\.tif$") %>%
na.omit() %>%
data.frame(filename = .) %>%
mutate(year = str_extract(filename, "\\d{7}"),
year = str_extract(year, "^\\d{4}"),
year = as.integer(year)) %>%
filter(year >= year_start & year <= year_end)
wanted_urls %>%
#
# to test the code I only download 2 files;
# comment out this instruction to download all of them
head(n = 2) %>%
#
pull(filename) %>%
lapply(\(x) {
wget_cmd <- wget_cmd_line
wget_cmd[3] <- paste0(link, x)
system2("wget", args = wget_cmd, stdout = TRUE, stderr = TRUE)
})
# put saved value back
options(old_timeout)