When a connection is created with open="r"
it allows for line-by-line reading, which is useful for batch processing large data streams. For example this script parses a sizable gzipped JSON HTTP stream by reading 100 lines at a time. However unfortunately R does not support SSL:
> readLines(url("https://api.github.com/repos/jeroenooms/opencpu"))
Error in readLines(url("https://api.github.com/repos/jeroenooms/opencpu")) :
cannot open the connection: unsupported URL scheme
The RCurl
and httr
packages do support HTTPS, but I don't think they are capable of creating a connection object similar to url()
. Is there some other way to do line-by-line reading of an HTTPS connection similar to the example in the script above?
One solution is to manually call the curl
executable via pipe
. The following seems to work.
library(jsonlite)
stream_https <- gzcon(pipe("curl https://jeroenooms.github.io/files/hourly_14.json.gz", open="r"))
batches <- list(); i <- 1
while(length(records <- readLines(gzstream, n = 100))){
message("Batch ", i, ": found ", length(records), " lines of json...")
json <- paste0("[", paste0(records, collapse=","), "]")
batches[[i]] <- fromJSON(json, validate=TRUE)
i <- i+1
}
weather <- rbind.pages(batches)
rm(batches); close(gzstream)
However this is suboptimal because the curl
executable might not be available for various reasons. Would be much nicer to invoke this pipe directly via RCurl/libcurl.