I have an xml file which I want to extract data from. Ultimately, what I need is a table that shows the name of the node (i.e. NODE36
and NODE44
) with the information in the tables (see the desired output below).
Is there a way to use regex
or an XML parser to extract the data as a table?
<?xml version="1.0" encoding="UTF-8"?>
<Document>
<name>culverts.XML</name>
<StyleMap id="m_ylw-pushpin29">
<Pair>
<key>normal</key>
<styleUrl>#s_ylw-pushpin00</styleUrl>
</Pair>
<Pair>
<key>highlight</key>
<styleUrl>#s_ylw-pushpin_hl25</styleUrl>
</Pair>
</StyleMap>
<Folder>
<name>culverts.XML</name>
<open>1</open>
<description>Culvert</description>
<Placemark>
<name>NODE36</name>
<description><![CDATA[<br><br><br>
<table border="1" padding="0">
<tr><td>Objectid</td><td>1</td></tr>
<tr><td>On_route</td><td>Mid Turnpike</td></tr>
<tr><td>Road_numbe</td><td>54</td></tr>
<tr><td>Recommenda</td><td>Continue to monitor.</td></tr>]]></description>
<styleUrl>#m_ylw-pushpin29</styleUrl>
<Point>
<extrude>1</extrude>
<altitudeMode>relativeToGround</altitudeMode>
<coordinates>-74.249045,45.997986,0</coordinates>
</Point>
</Placemark>
<Placemark>
<name>NODE44</name>
<description><![CDATA[<br><br><br>
<table border="1" padding="0">
<tr><td>Objectid</td><td>2</td></tr>
<tr><td>On_route</td><td>Mid Turnpike</td></tr>
<tr><td>Road_numbe</td><td>54</td></tr>
<tr><td>Recommenda</td><td>Not Available.</td></tr>]]></description>
<styleUrl>#m_ylw-pushpin29</styleUrl>
<Point>
<extrude>1</extrude>
<altitudeMode>relativeToGround</altitudeMode>
<coordinates>-74.24906300000001,45.998057,0</coordinates>
</Point>
</Placemark>
</Folder>
</Document>
Name | Objectid | On_route | Road_numbe | Recommenda |
---|---|---|---|---|
NODE36 | 1 | Mid Turnpike | 54 | Continue to monitor. |
NODE44 | 2 | Mid Turnpike | 54 | Not Available. |
I tried regex
to extract the data between <Placemark>
and </Placemark>
to no avail;
library(qdapRegex)
my_tbl <- rm_between(file_str, 'Placemark', '/Placemark', extract=TRUE)[[1]]
or
my_tbl <- str_extract_all(file_str, "Placemark((.|\n)*)/Placemark")
Error in stri_extract_all_regex(string, pattern, simplify = simplify, :
Regular expression backtrack stack overflow. (U_REGEX_STACK_OVERFLOW)
I cannot get this one to work in R. Although even if I could, it matches the first occurrence of <Placemark>
with the last occurrence of </Placemark>
; see here: https://regex101.com/r/bQOdDJ/1
Here's a method with a helper function to turn the HTML table into a data.frame. Basically we need to do a bunch of iterations and parsing of the HTML data.
library(xml2)
library(purrr)
doc <- xml2::read_xml(xx)
table_to_dataframe <- function(x) {
x |> xml_find_all(".//tr") |>
map(function(x) {
x |> xml_find_all("./td") |> xml_text()
}) |>
do.call("rbind", args=_) |>
(function(x) setNames(x[,2], x[,1]))() |>
bind_rows()
}
doc |>
xml_find_all("//Placemark") |>
map_df(function(p) {
name <- p |> xml_find_first("./name") |> xml_text()
sub <- p |> xml_find_first("./description") |> xml_text() |> read_html()
bind_cols(tibble(name), table_to_dataframe(sub))
})
Which returns
name Objectid On_route Road_numbe Recommenda
<chr> <chr> <chr> <chr> <chr>
1 NODE36 1 Mid Turnpike 54 Continue to monitor.
2 NODE44 2 Mid Turnpike 54 Not Available.