Search code examples
rxmldataframe

How can I convert hierarchical node data in R into a dataframe?


I have the following xml file.

<?xml version="1.0" encoding="UTF-8"?>

<gudid xmlns="http://www.fda.gov/cdrh/gudid" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.0" xsi:schemaLocation="http://www.fda.gov/cdrh/gudid gudid.xsd">

  <header>
  <database frequency="monthly" id="5460" type="FULL">
  <downloadFile part="1" totalParts="174"/>
  <numberRecordXML>25000</numberRecordXML>
  <numberRecordsDatabase>4334252</numberRecordsDatabase>
  </database>
  <creationDate>2024-04-01T03:30:00</creationDate>
  <period end="2024-04-01T03:30:00" start="2014-09-24T00:00:00"/>
  </header>

  <device xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.fda.gov/cdrh/gudid">
    <publicDeviceRecordKey>33db3dc9-1c5e-4670-a3e1-b52ae0e8c7f0</publicDeviceRecordKey>
    <publicVersionStatus>Update</publicVersionStatus>
    <deviceRecordStatus>Published</deviceRecordStatus>
    <!--Note: publicVersionNumber is system generated and assigned to all device records to track public release updates to a device record-->
    <publicVersionNumber>3</publicVersionNumber>
    <publicVersionDate>2018-07-06</publicVersionDate>
    <devicePublishDate>2016-09-01</devicePublishDate>
    <deviceCommDistributionEndDate xsi:nil="true"></deviceCommDistributionEndDate>
    <deviceCommDistributionStatus>In Commercial Distribution</deviceCommDistributionStatus>
    <identifiers>
      <identifier>
        <deviceId>M991NM82330A243</deviceId>
        <deviceIdType>Package</deviceIdType>
        <deviceIdIssuingAgency>HIBCC</deviceIdIssuingAgency>
        <containsDINumber>M991NM82330A242</containsDINumber>
        <pkgQuantity>4</pkgQuantity>
        <pkgDiscontinueDate xsi:nil="true"></pkgDiscontinueDate>
        <pkgStatus>In Commercial Distribution</pkgStatus>
        <pkgType>Case</pkgType>
      </identifier>
      <identifier>
        <deviceId>M991NM82330A242</deviceId>
        <deviceIdType>Package</deviceIdType>
        <deviceIdIssuingAgency>HIBCC</deviceIdIssuingAgency>
        <containsDINumber>M991NM82330A241</containsDINumber>
        <pkgQuantity>8</pkgQuantity>
        <pkgDiscontinueDate xsi:nil="true"></pkgDiscontinueDate>
        <pkgStatus>In Commercial Distribution</pkgStatus>
        <pkgType>Box</pkgType>
      </identifier>
      <identifier>
        <deviceId>M991NM82330A241</deviceId>
        <deviceIdType>Primary</deviceIdType>
        <deviceIdIssuingAgency>HIBCC</deviceIdIssuingAgency>
        <containsDINumber xsi:nil="true"></containsDINumber>
        <pkgQuantity xsi:nil="true"></pkgQuantity>
        <pkgDiscontinueDate xsi:nil="true"></pkgDiscontinueDate>
        <pkgStatus xsi:nil="true"></pkgStatus>
        <pkgType xsi:nil="true"></pkgType>
      </identifier>
    </identifiers>
    <brandName>Clear-View MAX &quot;Sub-Q&quot; Infusion Set</brandName>
    <versionModelNumber>ClearView™MAX</versionModelNumber>
    <catalogNumber>NM82330A-24</catalogNumber>
    <dunsNumber>013861471</dunsNumber>
    <companyName>NORFOLK MEDICAL</companyName>
    <deviceCount>1</deviceCount>
    <deviceDescription>24G x 12mm x 24&quot; Clear-View MAX &quot;Sub-Q&quot; Infusion Set</deviceDescription>
    <DMExempt>true</DMExempt>
    <premarketExempt>false</premarketExempt>
    <deviceHCTP>false</deviceHCTP>
    <deviceKit>false</deviceKit>
    <deviceCombinationProduct>false</deviceCombinationProduct>
    <singleUse>true</singleUse>
    <lotBatch>true</lotBatch>
    <serialNumber>false</serialNumber>
    <manufacturingDate>true</manufacturingDate>
    <expirationDate>true</expirationDate>
    <donationIdNumber>false</donationIdNumber>
    <labeledContainsNRL>false</labeledContainsNRL>
    <labeledNoNRL>true</labeledNoNRL>
    <MRISafetyStatus>Labeling does not contain MRI Safety Information</MRISafetyStatus>
    <rx>true</rx>
    <otc>false</otc>
    <contacts>
      <customerContact>
        <phone>+1(847)674-7075</phone>
        <phoneExtension>102</phoneExtension>
        <email>[email protected]</email>
      </customerContact>
    </contacts>
    <premarketSubmissions>
      <premarketSubmission>
        <submissionNumber>K870188</submissionNumber>
        <supplementNumber>000</supplementNumber>
      </premarketSubmission>
    </premarketSubmissions>
    <gmdnTerms>
      <gmdn>
        <gmdnCode>35833</gmdnCode>
        <gmdnPTName>Electric infusion pump administration set, single-use</gmdnPTName>
        <gmdnPTDefinition>A collection of sterile devices (e.g., plastic tubing, check valve, roller clamp, Y-site connector, Luer, needle/catheter) intended to be used in combination with an electrically-powered infusion pump for the intravenous (IV), subcutaneous, intramuscular, or epidural administration of medication. This is a single-use device.</gmdnPTDefinition>
        <implantable>false</implantable>
        <gmdnCodeStatus>Active</gmdnCodeStatus>
      </gmdn>
    </gmdnTerms>
    <productCodes>
      <fdaProductCode>
        <productCode>FPA</productCode>
        <productCodeName>Set, administration, intravascular</productCodeName>
      </fdaProductCode>
    </productCodes>
    <deviceSizes>
      <deviceSize>
        <sizeType>Needle Gauge</sizeType>
        <size value="24" unit="Gauge"/>
        <sizeText xsi:nil="true"></sizeText>
      </deviceSize>
    </deviceSizes>
    <environmentalConditions/>
    <sterilization>
      <deviceSterile>true</deviceSterile>
      <sterilizationPriorToUse>true</sterilizationPriorToUse>
      <methodTypes>
        <sterilizationMethod>Ethylene Oxide</sterilizationMethod>
      </methodTypes>
    </sterilization>
  </device>

</gudid>

I want to convert it into a dataframe. The identifiers have a hierarchy and there are three identifiers. The code I've written is as follows, but in this case, all the identifiers are put in one cell in the identifiers column, making it impossible to distinguish the values.

setwd('D:/')

doc <- read_xml('xmltest.xml')
xml <- xmlParse(doc)

df <- xmlToDataFrame(xml)

Additionally, I want to convert only the device node into a dataframe, leaving the header node untouched.


Solution

  • Here is a around about method making use of the jsonlite library to convert the list into a data frame.

    library(xml2)
    library(dplyr)
    library(jsonlite)
    page <- read_xml('xmltest.xml' )
    
    #strip the names spaces
    xml_ns_strip(page)
    
    #find the nodes and convert to a list 
    listings <- xml_find_all(page, ".//identifier") %>% as_list() 
    
    #then use json lite library to covert into a data frame
       listings %>% jsonlite::toJSON() %>% jsonlite::fromJSON(simplifyDataFrame = TRUE)
       
    #or base R
       bind_rows(lapply(listings, unlist))