Search code examples
xmlrcdata

Parse CDATA fields in xmlToList


I'm having trouble with xmlToList, specifically several CDATA fields in an API response.

I'm working with an API that returns either XML or JSON. I'm using XML::xmlToList to translate the XML-formatted API response into a list structure and RJSONIO's fromJSON to do the same with the JSON format.

The fromJSON output is exactly what I want but I want to be able to get the same structure from the XML response.

The main issue is that xmlToList seems to discard the contents of fields if they're inside a CDATA wrapper.

Here's an example URL for the API (in XML): http://www.colourlovers.com/api/color/6B4106

And here's one in JSON: http://www.colourlovers.com/api/color/6B4106?format=json

As you can see in the first link, there are several fields with values stored in CDATA, like title.

<title>
<![CDATA[ wet dirt ]]>
</title>

If I parse this with fromJSON, I get the following:

List of 17
 $ id         : num 903893
 $ title      : chr "wet dirt"
 $ userName   : chr "jessicabrown"
 $ numViews   : num 323
 $ numVotes   : num 1
 $ numComments: num 0
 $ numHearts  : num 0
 $ rank       : num 0
 $ dateCreated: chr "2008-03-17 11:22:21"
 $ hex        : chr "6B4106"
 $ rgb        :List of 3
  ..$ red  : num 107
  ..$ green: num 65
  ..$ blue : num 6
 $ hsv        :List of 3
  ..$ hue       : num 35
  ..$ saturation: num 94
  ..$ value     : num 42
 $ description: chr ""
 $ url        : chr "http://www.colourlovers.com/color/6B4106/wet_dirt"
 $ imageUrl   : chr "http://www.colourlovers.com/img/6B4106/100/100/wet_dirt.png"
 $ badgeUrl   : chr "http://www.colourlovers.com/images/badges/c/903/903893_wet_dirt.png"
 $ apiUrl     : chr "http://www.colourlovers.com/api/color/6B4106"

The title field is just a character string, as desired. But using xmlToList, I get:

List of 17
 $ id         : chr "903893"
 $ title      :List of 1
  ..$ : NULL
 $ userName   :List of 1
  ..$ : NULL
 $ numViews   : chr "323"
 $ numVotes   : chr "1"
 $ numComments: chr "0"
 $ numHearts  : chr "0"
 $ rank       : chr "0"
 $ dateCreated: chr "2008-03-17 11:22:21"
 $ hex        : chr "6B4106"
 $ rgb        :List of 3
  ..$ red  : chr "107"
  ..$ green: chr "65"
  ..$ blue : chr "6"
 $ hsv        :List of 3
  ..$ hue       : chr "35"
  ..$ saturation: chr "94"
  ..$ value     : chr "42"
 $ description:List of 1
  ..$ : NULL
 $ url        :List of 1
  ..$ : NULL
 $ imageUrl   :List of 1
  ..$ : NULL
 $ badgeUrl   :List of 1
  ..$ : NULL
 $ apiUrl     : chr "http://www.colourlovers.com/api/color/6B4106"

Instead of returning either <![CDATA[ wet dirt ]]> or wet dirt, as I would expect, I just get a single-element list with NULL contents. How can I get xmlToList to handle the CDATA elements?

Here's the code:

xmlurl <- url('http://www.colourlovers.com/api/color/6B4106')
response1 <- paste(readLines(xmlurl, warn=FALSE), collapse='')
close(xmlurl)

jsonurl <- url('http://www.colourlovers.com/api/color/6B4106?format=json')
response2 <- paste(readLines(jsonurl, warn=FALSE), collapse='')
close(jsonurl)

str(XML::xmlToList(response1))
str(RJSONIO::fromJSON(response2))

Solution

  • Have a look at XML:::parserOptions

    Use

    test <- xmlParse("http://www.colourlovers.com/api/color/6B4106", options = NOCDATA)
    res <- xmlToList(test)
    
    > res$color$title
    [1] "wet dirt"
    >