Search code examples
xmlrxsltsasscopus

How do I modify the top XML node in R?


I would like to add an attribute to the very top node of an xml file and then save the file. I've tried every combination of xpath and subsetting I can think of, but just can't seem to make it work. To use a simple example:

xml_string = c(
 '<?xml version="1.0" encoding="UTF-8"?>',
 '<retrieval-response status = "found">',
      '<coredata>',
           '<id type = "author" >12345</id>',
      '</coredata>',
      '<author>',
           '<first>John</first>',
           '<last>Doe</last>',
      '</author>',
 '</retrieval-response>')

# parse xml content
xml = xmlParse(xml_string)

When I try

xmlAttrs(xml["/retrieval-response"][[1]]) <- c(id = 12345)

I get an error:

object of type 'externalptr' is not subsettable

However, the attribute is inserted, so I'm not sure what I'm doing wrong.

(more background: this is a simplified version of the data from Scopus's API. I am combining thousands of xml files structured similarly, but the id is in the "coredata" node which is a sibling to the "author" node which contains all of the data, so when I use SAS to compile the combined XML document into datasets there is no link between the id and the data. I'm hoping that adding the id to the top of the hierarchy will cause it to propagate down to all of the other levels).


Solution

  • Edit: After trying the approach of editing the top node (see Old Answer below), I realized that editing the top node doesn't solve my problem because the SAS XML mapper did not retain all of the ids.

    I tried a new approach of adding the author id to each of the subnodes which worked perfectly. I also learned that you can use XPath to select multiple nodes by putting them into a vector, like this:

    c("//coredata",
      "//affiliation-current",
      "affiliation-history",
      "subject-areas",
      "//author-profile")
    

    So the final program I used was:

    files <- list.files()
    
    for (i in 1:length(files)) {
         author_record <- xmlParse(files[i])
    
         xpathApply(
              author_record, c(
                   "//coredata",
                   "//affiliation-current",
                   "affiliation-history",
                   "subject-areas",
                   "//author-profile"
              ),
              addAttributes,
              auth_id = gsub("AUTHOR_ID:", "", xmlValue(author_record[["//dc:identifier"]]))
         )
    
         saveXML(author_record, file = files[i])
    }
    

    Old Answer: After much experimentation I found a rather simple solution to my problem.

    Attributes can be added to the top node by simply using

    addAttributes(xmlRoot(xmlfile), attribute = "attributeValue") 
    

    For my specific case, the most straightforward solution will be a simple loop:

    setwd("C:/directory/with/individual/xmlfiles")
    
    files <- list.files()
    
    for (i in 1:length(files)) {
    
     author_record <- xmlParse(files[i])
    
     addAttributes(node = xmlRoot(author_record), 
                   id   = gsub   (pattern = "AUTHOR_ID:", 
                                  replacement = "", 
                                  x = xmlValue(auth[["//dc:identifier"]])
                   )
     )
    
      saveXML(author_record, file = files[i])
    }
    

    I'm sure there are better ways. Clearly I need to learn XLST, that was a very powerful approach!