I am trying to import an XML file from a URL:
library(xml2)
x <- read_xml('https://ftp.ncbi.nlm.nih.gov/pub/GTR/data/gtr_ftp.xml.gz')
Error in read_xml.raw(raw, encoding = encoding, base_url = base_url, as_html = as_html, :
Start tag expected, '<' not found [4]
According to the documentation I should be able to pass a URL for a .gz file and it will be uncompressed. If I download the file, unzip it locally, and then use read_xml
it works fine. This is a pretty large file (~ 2 GB unzipped) and so I am not sure if that is a problem over a connection. Any thoughts on how I can read this directly from a connection?
The catch is that the documentation says "Local paths ending in .gz, .bz2, .xz, .zip will be automatically uncompressed" (emphasis added). The logic seems to be in the xml2:::path_to_connection
function. URLs are not automatically uncompressed, only local files on disc.
The read_xml
function will use the curl
package to work with URLs if installed. If you have that package, you can wrap the download call with gzcon
to do the decoding. Assuming you have enough RAM, you could try
x <- read_xml(gzcon(curl::curl(url)))