Search code examples
rxmlxml-parsingxml2

Extract XML data in tabular format


I have an xml file which I want to extract data from. Ultimately, what I need is a table that shows the name of the node (i.e. NODE36 and NODE44) with the information in the tables (see the desired output below).

Is there a way to use regex or an XML parser to extract the data as a table?

<?xml version="1.0" encoding="UTF-8"?>
<Document>
    <name>culverts.XML</name>
    <StyleMap id="m_ylw-pushpin29">
        <Pair>
            <key>normal</key>
            <styleUrl>#s_ylw-pushpin00</styleUrl>
        </Pair>
        <Pair>
            <key>highlight</key>
            <styleUrl>#s_ylw-pushpin_hl25</styleUrl>
        </Pair>
    </StyleMap>
<Folder>
        <name>culverts.XML</name>
        <open>1</open>
        <description>Culvert</description>
        <Placemark>
            <name>NODE36</name>
            <description><![CDATA[<br><br><br>
    <table border="1" padding="0">
    <tr><td>Objectid</td><td>1</td></tr>
    <tr><td>On_route</td><td>Mid Turnpike</td></tr>
    <tr><td>Road_numbe</td><td>54</td></tr>
    <tr><td>Recommenda</td><td>Continue to monitor.</td></tr>]]></description>
            <styleUrl>#m_ylw-pushpin29</styleUrl>
            <Point>
                <extrude>1</extrude>
                <altitudeMode>relativeToGround</altitudeMode>
                <coordinates>-74.249045,45.997986,0</coordinates>
            </Point>
        </Placemark>
        <Placemark>
            <name>NODE44</name>
            <description><![CDATA[<br><br><br>
    <table border="1" padding="0">
    <tr><td>Objectid</td><td>2</td></tr>
    <tr><td>On_route</td><td>Mid Turnpike</td></tr>
    <tr><td>Road_numbe</td><td>54</td></tr>
    <tr><td>Recommenda</td><td>Not Available.</td></tr>]]></description>
            <styleUrl>#m_ylw-pushpin29</styleUrl>
            <Point>
                <extrude>1</extrude>
                <altitudeMode>relativeToGround</altitudeMode>
                <coordinates>-74.24906300000001,45.998057,0</coordinates>
            </Point>
        </Placemark>
    </Folder>
</Document>

Desired Output:

Name Objectid On_route Road_numbe Recommenda
NODE36 1 Mid Turnpike 54 Continue to monitor.
NODE44 2 Mid Turnpike 54 Not Available.

I tried regex to extract the data between <Placemark> and </Placemark> to no avail;

library(qdapRegex)
my_tbl <- rm_between(file_str, 'Placemark', '/Placemark', extract=TRUE)[[1]]

or

my_tbl <- str_extract_all(file_str, "Placemark((.|\n)*)/Placemark")
Error in stri_extract_all_regex(string, pattern, simplify = simplify,  : 
  Regular expression backtrack stack overflow. (U_REGEX_STACK_OVERFLOW)

I cannot get this one to work in R. Although even if I could, it matches the first occurrence of <Placemark> with the last occurrence of </Placemark>; see here: https://regex101.com/r/bQOdDJ/1


Solution

  • Here's a method with a helper function to turn the HTML table into a data.frame. Basically we need to do a bunch of iterations and parsing of the HTML data.

    library(xml2)
    library(purrr)
    doc <- xml2::read_xml(xx)
    
    table_to_dataframe <- function(x) {
      x |> xml_find_all(".//tr") |> 
        map(function(x) {
          x |> xml_find_all("./td") |> xml_text()
        }) |>
        do.call("rbind", args=_) |>
        (function(x) setNames(x[,2], x[,1]))() |>
        bind_rows()
    }
    
    doc |>
      xml_find_all("//Placemark") |>
      map_df(function(p) {
        name <- p |> xml_find_first("./name") |> xml_text()
        sub <- p |> xml_find_first("./description") |> xml_text() |> read_html()
        bind_cols(tibble(name), table_to_dataframe(sub))
      })
    

    Which returns

      name   Objectid On_route     Road_numbe Recommenda          
      <chr>  <chr>    <chr>        <chr>      <chr>               
    1 NODE36 1        Mid Turnpike 54         Continue to monitor.
    2 NODE44 2        Mid Turnpike 54         Not Available.