Search code examples
rxmlxml2

Extract XML to dataframe in R


Through a POST request using the httr package, I get back XML in the following format:-

<ReportDelivery responsecode="0" responsetext="descriptive text">
  <Terminal isn="DCC000000001" imo="111111111" name="MV Vessel A">
    <Report>
      <DateTime>01/10/2014 15:30:45</DateTime> 
      <Lat>99.9999999</Lat> 
      <Lon>999.9999999</Lon> 
      <Cog>999</Cog> 
      <Sog>999</Sog> 
      <Voltage>99</Voltage> 
      <Status>Description of status</Status> 
    </Report>
    <Report>
      ..
    </Report>
  </Terminal>
  <Terminal isn="DCC000000002" imo="222222222" name="MV Vessel B">
   ..
  </Terminal>
</ReportDelivery>

I am able to get the "Report" part in a data frame using two functions available here:-

#Using functions from https://rud.is/rpubs/xml2power/
xtrct <- function(doc, target) { xml_find_all(doc, target) %>% xml_text() %>% trimws() }

xtrct_df <- function(doc, top) {
  xml_find_first(doc, sprintf(".//%s", top)) %>%
    xml_children() %>%
    xml_name() %>%
    map(~{
      xtrct(doc, sprintf(".//%s/%s", top, .x)) %>%
        list() %>%
        set_names(tolower(.x))
    }) %>%
    flatten_df() %>%
    readr::type_convert()
}

x <- xtrct_df(doc, "Report")

Within each Terminal node, there are multiple reports pertaining to a particular ship whose attributes are given in the Terminal node.

Currently, the columns in x are:

names(x)
[1] "datetime" "lat"      "lon"      "cog"      "sog"      "voltage"  "status" 

How can I add the name of the ship as a column to this dataframe? I can extract the name attribute using :-

xattrs <- xpathSApply(z, "//*/Terminal/@name")

BBut have no clue on how to include this as a variable in the dataframe. Would appreciate some help please.


Solution

  • Taking a somewhat different route from @hrbrmstr, we can map_df on each element, while also finding the parent and extracting the appropriate attr:

    library(xml2)
    library(purrr)
    
    col_names <- read_xml(x) %>%
      xml_find_first('.//Report') %>% 
      xml_children() %>% 
      xml_name() 
    
    read_xml(x) %>% 
      xml_find_all(".//Report") %>% 
      map_df(~{
        parent_name <- xml_parent(.x) %>% 
          xml_attr('name') %>% 
          as.character()
    
        xml_children(.x) %>% 
          as_list() %>% 
          data.frame(stringsAsFactors = FALSE) %>% 
          set_names(col_names) %>% 
          cbind(VesselName=parent_name)
      })
    
    #>              DateTime        Lat         Lon Cog Sog Voltage
    #> 1 01/10/2014 15:30:45 99.9999999 999.9999999 999 999      99
    #> 2 01/10/2014 15:30:45 99.9999999 999.9999999 999 999      99
    #> 3 01/10/2014 15:30:45 99.9999999 999.9999999 999 999      99
    #>                  Status  VesselName
    #> 1 Description of status MV Vessel A
    #> 2 Description of status MV Vessel A
    #> 3 Description of status MV Vessel B
    

    Created on 2018-05-07 by the reprex package (v0.2.0).

    Data:

    x <- '<ReportDelivery responsecode="0" responsetext="descriptive text">
      <Terminal isn="DCC000000001" imo="111111111" name="MV Vessel A">
        <Report>
          <DateTime>01/10/2014 15:30:45</DateTime> 
          <Lat>99.9999999</Lat> 
          <Lon>999.9999999</Lon> 
          <Cog>999</Cog> 
          <Sog>999</Sog> 
          <Voltage>99</Voltage> 
          <Status>Description of status</Status> 
        </Report>
        <Report>
          <DateTime>01/10/2014 15:30:45</DateTime> 
          <Lat>99.9999999</Lat> 
          <Lon>999.9999999</Lon> 
          <Cog>999</Cog> 
          <Sog>999</Sog> 
          <Voltage>99</Voltage> 
          <Status>Description of status</Status> 
        </Report>
      </Terminal>
      <Terminal isn="DCC000000002" imo="222222222" name="MV Vessel B">
        <Report>
          <DateTime>01/10/2014 15:30:45</DateTime> 
          <Lat>99.9999999</Lat> 
          <Lon>999.9999999</Lon> 
          <Cog>999</Cog> 
          <Sog>999</Sog> 
          <Voltage>99</Voltage> 
          <Status>Description of status</Status> 
        </Report>
      </Terminal>
    </ReportDelivery>'