Search code examples
xmlrcran

R's XML package throws error on correct XML document


I should parse many XML documents using the R software using the XML package (Duncan Temple Lang, 2013). Here is an example: http://musicbrainz.org/ws/2/release?query=%22A%20Is%20for%20Alpine%22%20AND%20artist:%22Alpine%22

If the link is copy-pasted in the address bar of a browser, an XML page is displayed and its correctness has been checked with one of the many online validators. The http://validator.w3.org has been chosen and the markup of the XML document seems valid.

However using this code:

library(XML)
url = "http://musicbrainz.org/ws/2/release?query=%22A%20Is%20for%20Alpine%22%20AND%20artist:%22Alpine%22"
data = xmlTreeParse(url, asTree = TRUE)

the following error is reported:

Blank needed here
Error: 1: Blank needed here

Now, the error is similar to the one discussed here Validation problem with XML declaration, but can't see how the error applies to the XML document I would to parse.

Software: R version 3.0.2 (2013-09-25) -- "Frisbee Sailing"

Platform: x86_64-unknown-linux-gnu (64-bit)

XML package version 3.98-1.1


Solution

  • Download the file first using RCurl, then you should have no problem:

    library(RCurl)
    u <- getURL(url)
    
    > xmlTreeParse(u, asTree=TRUE)
    $doc
    $file
    [1] "<buffer>"
    
    $version
    [1] "1.0"
    
    $children
    $children$metadata
    <metadata created="2013-12-17T04:49:41.807Z" xmlns="http://musicbrainz.org/ns/mmd-2.0#" xmlns:ext="http://musicbrainz.org/ns/ext#-2.0">
     <release-list count="1" offset="0">
      <release id="d1e75e7b-fe4a-4cd6-b0d9-8ccf04a62406" score="100">
       <title>A Is for Alpine by Alpine</title>
       <status>Official</status>
       <text-representation>
        <language>eng</language>
        <script>Latn</script>
       </text-representation>
       <artist-credit>
        <name-credit>
         <artist id="d7f0c2fe-00fb-4248-995a-dbfd5a87331a">
          <name>Alpine</name>
          <sort-name>Alpine</sort-name>
         </artist>
        </name-credit>
       </artist-credit>
       <release-group id="7ea67d40-8819-4059-a9be-e1115cdf0ddb" type="Album">
        <primary-type>Album</primary-type>
       </release-group>
       <date>2012-08-10</date>
       <country>AU</country>
       <release-event-list>
        <release-event>
         <date>2012-08-10</date>
         <area id="106e0bec-b638-3b37-b731-f53d507dc00e">
          <name>Australia</name>
          <sort-name>Australia</sort-name>
          <iso-3166-1-code-list>
           <iso-3166-1-code>AU</iso-3166-1-code>
          </iso-3166-1-code-list>
         </area>
        </release-event>
       </release-event-list>
       <label-info-list>
        <label-info>
         <catalog-number>IVY166</catalog-number>
         <label id="96e57a7b-c481-41e5-a0d4-111604210207">
          <name>Ivy League Records</name>
         </label>
        </label-info>
       </label-info-list>
       <medium-list count="1">
        <track-count>12</track-count>
        <medium>
         <format>CD</format>
         <disc-list count="1"/>
         <track-list count="12"/>
        </medium>
       </medium-list>
      </release>
     </release-list>
    </metadata>
    
    
    attr(,"class")
    [1] "XMLDocumentContent"
    
    $dtd
    $external
    NULL
    
    $internal
    NULL
    
    attr(,"class")
    [1] "DTDList"
    
    attr(,"class")
    [1] "XMLDocument"         "XMLAbstractDocument"