Search code examples
rxmldataframexml2

Data frame from multiple XML files


I have a folder with multiple XML files. All the files have the same basic structure. However, each file actually contains data related to a single entity distributed among 16 parent nodes. Some nodes have children, some even have grandchildren, and some great-grandchildren.

I want to create a data frame from multiple files with only selected nodes/children/grandchildren.

As a first step, I read a single XML file as a list. Then ran a few lines of code to get the required data into a vector. Eventually, converted the vector to a dataframe, like I want one.

This the code:

library(xml2)
library(tidyverse)

x = as_list(read_xml("ACTRN12605000003673.xml"))

tmp = c(ACTR_Number = as.character(x$ANZCTR_Trial$actrnumber),
         primary_sponsor_type = as.character(x$ANZCTR_Trial$sponsorship$primarysponsortype),
         primary_sponsor_name = as.character(x$ANZCTR_Trial$sponsorship$primarysponsorname),
         primary_sponsor_address = as.character(x$ANZCTR_Trial$sponsorship$primarysponsoraddress),
         primary_sponsor_country = as.character(x$ANZCTR_Trial$sponsorship$primarysponsorcountry),
         funding_source_type = as.character(x$ANZCTR_Trial$sponsorship$fundingsource$fundingtype),
         funding_source_name = as.character(x$ANZCTR_Trial$sponsorship$fundingsource$fundingname),
         funding_source_address = as.character(x$ANZCTR_Trial$sponsorship$fundingsource$fundingaddress),
         funding_source_country = as.character(x$ANZCTR_Trial$sponsorship$fundingsource$fundingcountry),
         secondary_sponsor_type = as.character(x$ANZCTR_Trial$sponsorship$secondarysponsor$sponsortype),
         secondary_sponsor_name = as.character(x$ANZCTR_Trial$sponsorship$secondarysponsor$sponsorname),
         secondary_sponsor_address = as.character(x$ANZCTR_Trial$sponsorship$secondarysponsor$sponsoraddress),
         secondary_sponsor_country = as.character(x$ANZCTR_Trial$sponsorship$secondarysponsor$sponsorcountry))
tmp = as.list(tmp)
tmp = as.data.frame(tmp)

For the next step, I tried to work with 2 XML files together. I tried the following code to read two files simultaneously. However, beyond that, I don't know how to go ahead.

all_files = list.files(pattern=".xml", path = getwd(), full.names = TRUE)
x = lapply(all_files, read_xml)
class(x)

Sample files here


Solution

  • We can encapsulate your data gathering process into a function. Since each XML file seems to represent one row in your desired output dataframe, we can use purrr::map_dfr to construct the data rowwise. I slightly modified your code. See below:

    get_row_data <- function(file_name) {
      xml <- xml2::read_xml(file_name)
      read_text <- \(x, xpath) rvest::html_elements(x, xpath = xpath) |> rvest::html_text(TRUE) 
      xpaths <- c(
        ACTR_Number = "/ANZCTR_Trial/actrnumber", 
        primary_sponsor_type = "/ANZCTR_Trial/sponsorship/primarysponsortype", 
        primary_sponsor_name = "/ANZCTR_Trial/sponsorship/primarysponsorname", 
        primary_sponsor_address = "/ANZCTR_Trial/sponsorship/primarysponsoraddress", 
        primary_sponsor_country = "/ANZCTR_Trial/sponsorship/primarysponsorcountry",
        funding_source_type = "/ANZCTR_Trial/sponsorship/fundingsource/fundingtype",
        funding_source_name = "/ANZCTR_Trial/sponsorship/fundingsource/fundingname",
        funding_source_address = "/ANZCTR_Trial/sponsorship/fundingsource/fundingaddress",
        funding_source_country = "/ANZCTR_Trial/sponsorship/fundingsource/fundingcountry",
        secondary_sponsor_type = "/ANZCTR_Trial/sponsorship/secondarysponsor/sponsortype",
        secondary_sponsor_name = "/ANZCTR_Trial/sponsorship/secondarysponsor/sponsorname",
        secondary_sponsor_address = "/ANZCTR_Trial/sponsorship/secondarysponsor/sponsoraddress",
        secondary_sponsor_country = "/ANZCTR_Trial/sponsorship/secondarysponsor/sponsorcountry"
      )
      x <- read_text(xml, paste0(xpaths, collapse = " | "))
      names(x) <- names(xpaths)
      as_tibble(as.list(x))
    }
    
    all_files <- list.files(pattern = ".xml", path = getwd(), full.names = TRUE)
    purrr::map_dfr(all_files, get_row_data)
    

    Output

    # A tibble: 2 x 13
      ACTR_Number         primary_sponsor_~ primary_sponsor_n~ primary_sponsor_~ primary_sponsor~ funding_source_~ funding_source_~
      <chr>               <chr>             <chr>              <chr>             <chr>            <chr>            <chr>           
    1 ACTRN12605000003673 Hospital          Barwon Health      "272-322 Ryrie S~ Australia        Commercial sect~ Astra Zeneca    
    2 ACTRN12605000025639 Other Collaborat~ Australian Gastro~ "88 Mallett St\n~ Australia        Commercial sect~ Roche Products ~
    # ... with 6 more variables: funding_source_address <chr>, funding_source_country <chr>, secondary_sponsor_type <chr>,
    #   secondary_sponsor_name <chr>, secondary_sponsor_address <chr>, secondary_sponsor_country <chr>