I have the following information as a nested XML file that I'm trying to turn into a data.frame for analysis and reporting:
<node TEXT="Cost">
<node TEXT="Scale">
<node TEXT="1 - $0 to $100">
</node>
<node TEXT="2 - $100 to $500">
</node>
<node TEXT="3 - $500 to $1000">
</node>
<node TEXT="4 - $1000 to $5000">
</node>
<node TEXT="6 - $5000 +">
</node>
</node>
<node TEXT="Weight">
<node TEXT="1">
</node>
</node>
</node>
I'm able to read in the XML file and extract small portions, as shown below:
file <- '<node TEXT="Cost">
<node TEXT="Scale">
<node TEXT="1 - $0 to $100">
</node>
<node TEXT="2 - $100 to $500">
</node>
<node TEXT="3 - $500 to $1000">
</node>
<node TEXT="4 - $1000 to $5000">
</node>
<node TEXT="6 - $5000 +">
</node>
</node>
<node TEXT="Weight">
<node TEXT="1">
</node>
</node>
</node>
'
data <- read_xml(file)
xml_find_all(data,"//node/node[@TEXT = 'Scale']/node/@TEXT")
But what I really need to do is get it in the form of a data.frame like the following:
Node1 Node2 Node3
"Cost" "Scale" "1 - $0 to $100"
"Cost" "Scale" "2 - $100 to $500"
"Cost" "Scale" "3 - $500 to $1000"
"Cost" "Scale" "4 - $1000 to $5000"
"Cost" "Scale" "5 - $5000 +"
"Cost" "Weight" "1"
Can someone point me in the right direction?
Rather than using xslt
, you can also just iterate over the node list. Here we choose all the nodes three deep and then extract the TEXT attribute from all the parents (and finally bind it together with dplyr
)
library(dplyr)
xml_find_all(doc,"//node/node/node") %>% lapply(function(x) {
list(
NODE1=x %>% xml_parent %>% xml_parent %>% xml_attr("TEXT"),
NODE2=x %>% xml_parent %>% xml_attr("TEXT"),
NODE3=x %>% xml_attr("TEXT")
)
}) %>% bind_rows()