Search code examples
rxmlxml2

R: Extract XML node values into Data Frame from deeply nested XML file


I am trying to extract the list of ingredients from a xml FDA label in their database. Somehow I can't get the getNodeSet function to produce the appropriate list of nodes.

Question: How do I fix the code to be able to extract the values of the ingredient names.

This is the code that doesn't work, which breaks at the getNodeSet function, giving a 0 length list. I tried the [graciously proposed] solution on this answer (flatten xml), and many others with no luck.

library(XML)
URL <- "http://www.accessdata.fda.gov/spl/data/e9b9a189-d9e3-42e3-b41c-846a94ebeb37/e9b9a189-d9e3-42e3-b41c-846a94ebeb37.xml"
xmlDocu <- xmlParse(URL)
xmlIngredients <- getNodeSet(xmlDocu, "//ingredient/ingredientSubstance/name")
Ingredients <- xmlSApply(xmlIngredients, function(x) xmlSApply(x, xmlValue))
dfIngredients <- data.frame(t(Ingredients),row.names=NULL)

This is what a portion interest from the deeply nested xml file looks like:

              <ingredient classCode="ACTIB">
                   <quantity>
                      <numerator value="160" unit="mg"/>
                      <denominator value="5" unit="mL"/>
                   </quantity>
                   <ingredientSubstance>
                      <code code="362O9ITL9D" codeSystem="2.16.840.1.113883.4.9"/>
                      <name>Acetaminophen</name>
                      <activeMoiety>
                         <activeMoiety>
                            <code code="362O9ITL9D" codeSystem="2.16.840.1.113883.4.9"/>
                            <name>acetaminophen</name>
                         </activeMoiety>
                      </activeMoiety>
                   </ingredientSubstance>
                </ingredient>
                <ingredient classCode="IACT">
                   <ingredientSubstance>
                      <code code="3QPI1U3FV8" codeSystem="2.16.840.1.113883.4.9"/>
                      <name>BUTYLPARABEN</name>
                   </ingredientSubstance>
                </ingredient>

Solution

  • I believe your problem is related to the namespaces defined in the xml file. The are a couple ways of dealing with this problem. What I prefer is to use xml2 package and strip out the namespaces and then parse the file:

    library(xml2)
    
    URL <- "http://www.accessdata.fda.gov/spl/data/e9b9a189-d9e3-42e3-b41c-846a94ebeb37/e9b9a189-d9e3-42e3-b41c-846a94ebeb37.xml"
    xmlDocu <- read_xml(URL)
    #find namespace
    ns<-xml_ns(xmlDocu)
    
    #I find it easier to strip the name space and use accordingly
    xml_ns_strip(xmlDocu)
    xml_find_all(xmlDocu, "//ingredient")
    xml_text(xml_find_all(xmlDocu, "//ingredient/ingredientSubstance/name"))
    

    I find the syntax of the rvest and xml2 easier to use then the XML package.