Search code examples
rhttprcurlhttr

Line by line reading from HTTPS connection in R


When a connection is created with open="r" it allows for line-by-line reading, which is useful for batch processing large data streams. For example this script parses a sizable gzipped JSON HTTP stream by reading 100 lines at a time. However unfortunately R does not support SSL:

> readLines(url("https://api.github.com/repos/jeroenooms/opencpu"))
Error in readLines(url("https://api.github.com/repos/jeroenooms/opencpu")) : 
  cannot open the connection: unsupported URL scheme

The RCurl and httr packages do support HTTPS, but I don't think they are capable of creating a connection object similar to url(). Is there some other way to do line-by-line reading of an HTTPS connection similar to the example in the script above?


Solution

  • One solution is to manually call the curl executable via pipe. The following seems to work.

    library(jsonlite)
    stream_https <- gzcon(pipe("curl https://jeroenooms.github.io/files/hourly_14.json.gz", open="r"))
    batches <- list(); i <- 1
    while(length(records <- readLines(gzstream, n = 100))){
      message("Batch ", i, ": found ", length(records), " lines of json...")
      json <- paste0("[", paste0(records, collapse=","), "]")
      batches[[i]] <- fromJSON(json, validate=TRUE)
      i <- i+1
    }
    weather <- rbind.pages(batches)
    rm(batches); close(gzstream)
    

    However this is suboptimal because the curl executable might not be available for various reasons. Would be much nicer to invoke this pipe directly via RCurl/libcurl.