I would like to extract key-value pairs from an html file (see example below). Unfortunately, there is no html node that corresponds to each key-value pair (such as a div element). Instead, all information comes in a single paragraph where keys are highlighted as <strong>
.
I would like to present the key-value pairs as two columns of a dataframe or two same-length lists where key 1
corresponds to value 1
, key 2
to value 2a
and value 2b
, and key 3
to value 3
. Line breaks are not set consistently in the file.
Since there are no div elements for each pair, I would probably have to come up with a strategy to split the paragraph after each key? I attach a hacky attempt below that treats html as raw text:
library(tidyverse)
library(rvest)
html <- minimal_html(
"<p>
<strong>key 1</strong> value 1
<br></br>
<strong>key 2</strong> value 2a
<br></br>
value 2b
<br></br>
<strong>key 3</strong>
<br></br>
value 3
</p>"
)
# hacky solution treating html as raw text
s <- html |>
html_elements("p") |>
as.character()
parse_html <- function(s) {
s |>
read_html() |>
html_text2()
}
s |>
str_replace_all("<strong>(.*)</strong>", "✂️\\1🔧") |>
str_split_1("✂️") |>
map_chr(parse_html) |>
discard(\(x) str_length(x) == 0L) |>
str_split("🔧") |>
map(str_squish)
#> [[1]]
#> [1] "key 1" "value 1"
#>
#> [[2]]
#> [1] "key 2" "value 2a value 2b"
#>
#> [[3]]
#> [1] "key 3" "value 3"
Created on 2024-01-18 with reprex v2.1.0
With XPath we can get a bit further than with CSS selectors, while the example bellow seems to handle that included html snippet, I'm sure there are way more robust and elegant XPath strategies to achieve that same goal.
We can start from values, with .//p/strong/following-sibling::text()
we can get all text nodes (values) that follow strong
elements, whitespace-only strings are fine as we can filter those out later. While iterating through that node-set, we can now get the closest preceding strong
element (key) for every text node with ./preceding-sibling::strong[1]
.
library(tidyverse)
library(rvest)
html <- minimal_html(
"<p>
<strong>key 1</strong> value 1
<br></br>
<strong>key 2</strong> value 2a
<br></br>
value 2b
<br></br>
<strong>key 3</strong>
<br></br>
value 3
</p>"
)
preceding_key_text <- function(value_node){
html_element(value_node, xpath = "./preceding-sibling::strong[1]") |>
html_text(trim = TRUE)
}
html |>
html_elements(xpath = ".//p/strong/following-sibling::text()") |>
map(\(value_) list(
key = preceding_key_text(value_),
value = html_text(value_, trim = TRUE)
)) |>
discard(\(x) x[["value"]] == "") |>
bind_rows()
Result:
#> # A tibble: 4 × 2
#> key value
#> <chr> <chr>
#> 1 key 1 value 1
#> 2 key 2 value 2a
#> 3 key 2 value 2b
#> 4 key 3 value 3
Created on 2024-01-19 with reprex v2.0.2