Search code examples
rxml2

XML2-Package: How to treat empty Nodes?


I am trying to extract some data from an html site. I got 500 nodes which should conatain a date, a title and a summary. By using

url <- "https://www.bild.de/suche.bild.html?type=article&query=Migration&resultsPerPage=1000"
html_raw <- xml2::read_html(url)
main_node <- xml_find_all(html_raw, "//section[@class='query']/ol") %>%
  xml_children()

xml_find_all(main_node, ".//time") #time
xml_find_all(main_node, ".//span[@class='headline']") #title
xml_find_all(main_node, ".//p[@class='entry-content']") #summary

it returns three vectors with dates, titles and summaries, which than can be knitted together. At least in theory. Unfortunately my Code finds 500 dates, 500 titles but only 499 summaries. The reason for this is, that one of the nodes is just missing.

This leaves me with the problem, that I cannot bind this into an data frame because of the difference in length. The summaries wouldn't match the exact dates and titles.

An easy solution would be, to loop through the nodes and replace the empty node with a placeholder like an "NA".

dates <- c()
titles <- c()
summaries <- c()

for(i in 1:length(main_node)){
  date_temp <- xml_find_all(main_node[i], ".//time") %>%
    xml_text(trim = TRUE) %>%
    as.Date(format = "%d.%m.%Y")
  title_temp <- xml_find_all(main_node[i], ".//span[@class='headline']") %>%
    xml_text(trim = TRUE)
  summary_temp <- xml_find_all(main_node[i], ".//p[@class='entry-content']") %>%
    xml_text(trim = TRUE)

  if(length(summary_temp) == 0) summary_temp <- "NA"

  dates <- c(dates, date_temp)
  titles <- c(titles, title_temp)
  summaries <- c(summaries, summary_temp)
}

But this makes a simple three line code unnecessary long. So my question I guess is: Is there a more sophisticated approach than a loop?


Solution

  • You could use the purrr library to help and avoid the explicit loop

    library(purrr)
    dates <- main_node %>% map_chr(. %>% xml_find_first(".//time") %>% xml_text())
    titles <- main_node %>% map_chr(. %>% xml_find_first(".//span[@class='headline']") %>% xml_text())
    summaries <- main_node %>% map_chr(. %>% xml_find_first(".//p[@class='entry-content']") %>% xml_text())
    

    This uses the fact that xml_find_first will return NA if an elements is not found as pointed out by @Dave2e.

    But also in general growing a list by appending each iteration in a loop is very inefficient in R. It's better to pre-allocate the vector (since it will be of a known length) and then assign values each iteration to the proper slot (out[i] <- val). There's not really anything wrong with loops themselves in R; it's really just memory reallocation that can slow things down.