Search code examples
rcurlftprcurlhttr

Using R to access FTP Server and Download Files Results in Status "530 Not logged in"


What I'm Attempting to Do

I'm attempting to download several weather data files from the US National Climatic Data Centre's FTP server but am running into problems with an error message after successfully completing several file downloads.

After successfully downloading two station/year combinations I start getting an error "530 Not logged in" message. I've tried starting at the offending year and running from there and get roughly the same results. It downloads a year or two of data and then stops with the same error message about not being logged in.

Working Example

Following is a working example (or not) with the output truncated and pasted below.

options(timeout = 300)
ftp <- "ftp://ftp.ncdc.noaa.gov/pub/data/gsod/"
td <- tempdir()
station <– c("983240-99999", "983250-99999", "983270-99999", "983280-99999", "984260-41231", "984290-99999", "984300-99999", "984320-99999", "984330-99999")
years <- 1960:2016

for (i in years) {
  remote_file_list <- RCurl::getURL(
    paste0(ftp, "/", i, "/"), ftp.use.epsv = FALSE, ftplistonly = TRUE,
    crlf = TRUE, ssl.verifypeer = FALSE)
  remote_file_list <- strsplit(remote_file_list, "\r*\n")[[1]]

  file_list <- paste0(station, "-", i, ".op.gz")

  file_list <- file_list[file_list %in% remote_file_list]

  file_list <- paste0(ftp, i, "/", file_list)

  Map(function(ftp, dest) utils::download.file(url = ftp,
                                               destfile = dest, mode = "wb"),
      file_list, file.path(td, basename(file_list)))
}


trying URL 'ftp://ftp.ncdc.noaa.gov/pub/data/gsod/1960/983250-99999-1960.op.gz'
Content type 'unknown' length 7135 bytes
==================================================
downloaded 7135 bytes

...

trying URL 'ftp://ftp.ncdc.noaa.gov/pub/data/gsod/1961/984290-99999-1961.op.gz'
Content type 'unknown' length 7649 bytes
==================================================
downloaded 7649 bytes

trying URL 'ftp://ftp.ncdc.noaa.gov/pub/data/gsod/1962/983250-99999-1962.op.gz'
downloaded 0 bytes

 Error in utils::download.file(url = ftp, destfile = dest, mode = "wb") : 
 cannot download all files In addition: Warning message: 
 In utils::download.file(url = ftp, destfile = dest, mode = "wb") : 
 URL ftp://ftp.ncdc.noaa.gov/pub/data/gsod/1962/983250-99999-1962.op.gz':      
 status was '530 Not logged in'

Different Methods and Ideas I've Tried but Haven't Yet Been Successful

So far I've tried to slow the requests down using Sys.sleep in a for loop and any other manner of retrieving the files more slowly by opening then closing connections, etc. It's puzzling because: i) it works for a bit then stops and it's not related to the particular year/station combination per se; ii) I can use nearly the exact same code and download much larger annual files of global weather data without any errors over a long period of years like this; and iii) it's not always stopping after 1961 going to 1962, sometimes it stops at 1960 when it starts on 1961, etc., but it does seem to be consistently between years, not within from what I've found.

The login is anonymous, but you can use userpwd "ftp:your@email.address". So far I've been unsuccessful in using that method to ensure that I was logged in to download the station files.


Solution

  • I think you're going to need a more defensive strategy when working with this FTP server:

    library(curl)  # ++gd > RCurl
    library(purrr) # consistent "data first" functional & piping idioms FTW
    library(dplyr) # progress bar
    
    # We'll use this to fill in the years
    ftp_base <- "ftp://ftp.ncdc.noaa.gov/pub/data/gsod/%s/"
    
    dir_list_handle <- new_handle(ftp_use_epsv=FALSE, dirlistonly=TRUE, crlf=TRUE,
                                  ssl_verifypeer=FALSE, ftp_response_timeout=30)
    
    # Since you, yourself, noted the server was perhaps behaving strangely or under load
    # it's prbly a much better idea (and a practice of good netizenship) to cache the
    # results somewhere predictable rather than a temporary, ephemeral directory
    cache_dir <- "./gsod_cache"
    dir.create(cache_dir, showWarnings=FALSE)
    
    # Given the sporadic efficacy of server connection, we'll wrap our calls
    # in safe & retry functions. Change this variable if you want to have it retry
    # more times.
    MAX_RETRIES <- 6
    
    # Wrapping the memory fetcher (for dir listings)
    s_curl_fetch_memory <- safely(curl_fetch_memory)
    retry_cfm <- function(url, handle) {
    
      i <- 0
      repeat {
        i <- i + 1
        res <- s_curl_fetch_memory(url, handle=handle)
        if (!is.null(res$result)) return(res$result)
        if (i==MAX_RETRIES) { stop("Too many retries...server may be under load") }
      }
    
    }
    
    # Wrapping the disk writer (for the actual files)
    # Note the use of the cache dir. It won't waste your bandwidth or the
    # server's bandwidth or CPU if the file has already been retrieved.
    s_curl_fetch_disk <- safely(curl_fetch_disk)
    retry_cfd <- function(url, path) {
    
      # you should prbly be a bit more thorough than `basename` since
      # i think there are issues with the 1971 and 1972 filenames. 
      # Gotta leave some work up to the OP
      cache_file <- sprintf("%s/%s", cache_dir, basename(url))
      if (file.exists(cache_file)) return()
    
      i <- 0
      repeat {
        i <- i + 1
        if (i==6) { stop("Too many retries...server may be under load") }
        res <- s_curl_fetch_disk(url, cache_file)
        if (!is.null(res$result)) return()
      }
    
    }
    
    # the stations and years
    station <- c("983240-99999", "983250-99999", "983270-99999", "983280-99999",
                 "984260-41231", "984290-99999", "984300-99999", "984320-99999",
                 "984330-99999")
    years <- 1960:2016
    
    # progress indicators are like bowties: cool
    pb <- progress_estimated(length(years))
    walk(years, function(yr) {
    
      # the year we're working on
      year_url <- sprintf(ftp_base, yr)
    
      # fetch the directory listing
      tmp <- retry_cfm(year_url, handle=dir_list_handle)
      con <- rawConnection(tmp$content)
      fils <- readLines(con)
      close(con)
    
      # sift out only the target stations
      map(station, ~grep(., fils, value=TRUE)) %>%
        keep(~length(.)>0) %>%
        flatten_chr() -> fils
    
      # grab the stations files
      walk(paste(year_url, fils, sep=""), retry_cfd)
    
      # tick off progress
      pb$tick()$print()
    
    })
    

    You may also want to set curl_interrupt to TRUE in the curl handle if you want to be able to stop/esc/interrupt the downloads.