Search code examples
rhttr

R loop completes only 3 iterations out of 2504


I've written a function to download multiple files from NOAA's database. Firstly, I've got sites which is a list of site ID's that I want to download off the website. It looks like this:

 > head(sites)
   [[1]]
   [1] "9212"

   [[2]]
   [1] "10158"

   [[3]]
   [1] "11098"

> length(sites)
   [1] 2504

My function is shown below.

tested<-lapply(seq_along(sites), function(x) {
   no<-sites[[x]]
   data=GET(paste0('https://www.ncdc.noaa.gov/paleo-search/data/search.json?xmlId=', no))
   v<-content(data)   
   check=GET(v$statusUrl)
   j<-content(check)
   URL<-j$archive
   download.file(URL, destfile=paste0('./tree_ring/', no, '.zip'))
 }) 

The weird issue is that it works for the first three sites (downloads properly), but then it stops after the three sites and throws the following error:

Error in charToRaw(URL) : argument must be a character vector of length 1 

I've tried manually downloading the 4th and 5th site (using the same code as above, but not within function) and it works fine. What could be going on here?

EDIT 1: Showing more site ID's as requested

> dput(sites[1:6])
list("9212", "10158", "11098", "15757", "15777", "15781")

Solution

  • I converted your code to a for loop so I could see the most recent values of all your variables when things fail.

    The fails aren't consistently on the 4th site. Running your code a few times, sometimes it fails on 2, or 3, or 4. When it fails, if I look at j, I see this:

    $message
    [1] "finalizing archive"
    
    $status
    [1] "working"
    $message
    [1] "finalizing archive"
    
    $status
    [1] "working"
    

    If I re-run check=GET(v$statusUrl); j<-content(check) a few seconds later, then I see

    $archive
    [1] "https://www.ncdc.noaa.gov/web-content/paleo/bundle/1986420067_2020-04-23.zip"
    
    $status
    [1] "complete"
    

    So, I think it takes the server a little bit of time to prepare the file for download, and sometimes R asks for the file before it's ready, which causes an error. A simple fix might look like this:

    check_status <- function(v) {
      check <- GET(v$statusUrl)
      content(check)
    }
    
    for(x in seq_along(sites)) {
       no<-sites[[x]]
       data=GET(paste0('https://www.ncdc.noaa.gov/paleo-search/data/search.json?xmlId=', no))
       v<-content(data)
       try_counter <- 0
       j <- check_status(v)
       while(j$status != "complete" & try_counter < 100) {
         Sys.sleep(0.1)
         j <- check_status(v)
       }
       URL<-j$archive
       download.file(URL, destfile=paste0(no, '.zip'))
    }
    

    If the status isn't ready, this version will wait 0.1 seconds before checking again, up to 10 seconds.