Search code examples
rxmlxml2

R xml2 error: Start tag expected, '<' not found [4]


I am trying to import an XML file from a URL:

library(xml2)

x <- read_xml('https://ftp.ncbi.nlm.nih.gov/pub/GTR/data/gtr_ftp.xml.gz')
Error in read_xml.raw(raw, encoding = encoding, base_url = base_url, as_html = as_html,  : 
  Start tag expected, '<' not found [4]

According to the documentation I should be able to pass a URL for a .gz file and it will be uncompressed. If I download the file, unzip it locally, and then use read_xml it works fine. This is a pretty large file (~ 2 GB unzipped) and so I am not sure if that is a problem over a connection. Any thoughts on how I can read this directly from a connection?


Solution

  • The catch is that the documentation says "Local paths ending in .gz, .bz2, .xz, .zip will be automatically uncompressed" (emphasis added). The logic seems to be in the xml2:::path_to_connection function. URLs are not automatically uncompressed, only local files on disc.

    The read_xml function will use the curl package to work with URLs if installed. If you have that package, you can wrap the download call with gzcon to do the decoding. Assuming you have enough RAM, you could try

    x <- read_xml(gzcon(curl::curl(url)))