Search code examples
rnetcdfrcurl

Download files with specific extension from a website


How can I download the content of a webpage and find all files with specific extension listed on it. And then download all of them. For example, I would like to download all netcdf files (with extension *.nc4) from the following webpage: https://data.giss.nasa.gov/impacts/agmipcf/agmerra/.

I was recommended to look into Rcurl package but could not find how to do this.


Solution

  • library(stringr)
    
    # Get the context of the page
    thepage = readLines('https://data.giss.nasa.gov/impacts/agmipcf/agmerra/')
    
    # Find the lines that contain the names for netcdf files
    nc4.lines <- grep('*.nc4', thepage) 
    
    # Subset the original dataset leaving only those lines
    thepage <- thepage[nc4.lines]
    
    #extract the file names
    str.loc <- str_locate(thepage,'A.*nc4?"')
    
    #substring
    file.list <- substring(thepage,str.loc[,1], str.loc[,2]-1)
    
    # download all files
    for ( ifile in file.list){
     download.file(paste0("https://data.giss.nasa.gov/impacts/agmipcf/agmerra/",
                          ifile),
                   destfile=ifile, method="libcurl")