Search code examples
rxmlxml2

How to extract second attrbute from xml file line in R


I need to extract certain attributes from an xml file that has the same name of a node, but different number of attributes per node. The file is located here:

https://boardgamegeek.com//xmlapi//boardgame//13&type=boardgame,boardgameexpansion,boardgameaccesory,rpgitem,rpgissue,videogame&versions=1&stats=1&videos=1&marketplace=1&comments=1&pricehistory=1

And here is a small portion of the file itself:

<boardgames termsofuse="https://boardgamegeek.com/xmlapi/termsofuse">
  <boardgame objectid="13">
     <yearpublished>1995</yearpublished>
     <minplayers>3</minplayers>
     <maxplayers>4</maxplayers>
     <playingtime>120</playingtime>
     <minplaytime>60</minplaytime>
     <maxplaytime>120</maxplaytime>
     <age>10</age>
     <name sortindex="1">Catan</name>
     <name primary="true" sortindex="1">CATAN</name>
     <name sortindex="1">Catan (Колонизаторы)</name>
     <name sortindex="1">Catan telepesei</name>
     <name sortindex="1">Catan: Das Spiel</name>
     <name sortindex="1">Catan: Die Bordspel</name>
     <name sortindex="1">Catan: El Juego</name>
     <name sortindex="1">Catan: Gra planszowa</name>
     <name sortindex="1">Catan: Il Gioco</name>
     <name sortindex="1">Catan: Landnemarnir</name>

I want to extract only the value for "sortindex" from each line with "name" as the node name. I have tried the following, but it returns both the primary "true" and the sort index value for the second "name" node. I've tried so many different ways, and I can't get it to work. I've tried xmlGetAttr and others. How do I get this simple operation to work?

data <- read_xml(url)
xmlfile <- xmlParse(data)
xmltop = xmlRoot(xmlfile)
xmlSApply(getNodeSet(xmltop, '//name[@sortindex]'), xmlAttrs)

> xmlSApply(getNodeSet(xmltop, '//name[@primary]'), xmlAttrs)
             [,1]  
  primary   "true"
  sortindex "1"   

Solution

  • It sounds like you want to include any name node, even if it doesn't have the attribute. If so, you can try the following:

    data <- read_xml('https://boardgamegeek.com//xmlapi//boardgame//13&type=boardgame,boardgameexpansion,boardgameaccesory,rpgitem,rpgissue,videogame&versions=1&stats=1&videos=1&marketplace=1&comments=1&pricehistory=1')
    xmlfile <- xmlParse(data)
    xmltop <- xmlRoot(xmlfile)
    
    getAttr <- function(x, attrName) {
      attrs <- xmlAttrs(x)
      if (attrName %in% names(attrs)) {
        attrs[[attrName]]
      } else {
        NA
      }
    }
    
    xmlSApply(getNodeSet(xmltop, '//name'), function(x)getAttr(x, "sortindex"))
    
    xmlSApply(getNodeSet(xmltop, '//name'), function(x)getAttr(x, "primary"))
    

    If you don't want to include nodes without the attribute, then you can do something very similar:

    library(xml2)
    library(XML)
    
    
    data <- read_xml('https://boardgamegeek.com//xmlapi//boardgame//13&type=boardgame,boardgameexpansion,boardgameaccesory,rpgitem,rpgissue,videogame&versions=1&stats=1&videos=1&marketplace=1&comments=1&pricehistory=1')
    xmlfile <- xmlParse(data)
    xmltop <- xmlRoot(xmlfile)
    
    
    xmlSApply(getNodeSet(xmltop, '//name[@sortindex]'), function(x)xmlAttrs(x)[['sortindex']])
    
    xmlSApply(getNodeSet(xmltop, '//name[@primary]'), function(x)xmlAttrs(x)[['primary']])