I am trying to extract some data from an html site. I got 500 nodes which should conatain a date, a title and a summary. By using
url <- "https://www.bild.de/suche.bild.html?type=article&query=Migration&resultsPerPage=1000"
html_raw <- xml2::read_html(url)
main_node <- xml_find_all(html_raw, "//section[@class='query']/ol") %>%
xml_children()
xml_find_all(main_node, ".//time") #time
xml_find_all(main_node, ".//span[@class='headline']") #title
xml_find_all(main_node, ".//p[@class='entry-content']") #summary
it returns three vectors with dates, titles and summaries, which than can be knitted together. At least in theory. Unfortunately my Code finds 500 dates, 500 titles but only 499 summaries. The reason for this is, that one of the nodes is just missing.
This leaves me with the problem, that I cannot bind this into an data frame because of the difference in length. The summaries wouldn't match the exact dates and titles.
An easy solution would be, to loop through the nodes and replace the empty node with a placeholder like an "NA".
dates <- c()
titles <- c()
summaries <- c()
for(i in 1:length(main_node)){
date_temp <- xml_find_all(main_node[i], ".//time") %>%
xml_text(trim = TRUE) %>%
as.Date(format = "%d.%m.%Y")
title_temp <- xml_find_all(main_node[i], ".//span[@class='headline']") %>%
xml_text(trim = TRUE)
summary_temp <- xml_find_all(main_node[i], ".//p[@class='entry-content']") %>%
xml_text(trim = TRUE)
if(length(summary_temp) == 0) summary_temp <- "NA"
dates <- c(dates, date_temp)
titles <- c(titles, title_temp)
summaries <- c(summaries, summary_temp)
}
But this makes a simple three line code unnecessary long. So my question I guess is: Is there a more sophisticated approach than a loop?
You could use the purrr
library to help and avoid the explicit loop
library(purrr)
dates <- main_node %>% map_chr(. %>% xml_find_first(".//time") %>% xml_text())
titles <- main_node %>% map_chr(. %>% xml_find_first(".//span[@class='headline']") %>% xml_text())
summaries <- main_node %>% map_chr(. %>% xml_find_first(".//p[@class='entry-content']") %>% xml_text())
This uses the fact that xml_find_first
will return NA
if an elements is not found as pointed out by @Dave2e.
But also in general growing a list by appending each iteration in a loop is very inefficient in R. It's better to pre-allocate the vector (since it will be of a known length) and then assign values each iteration to the proper slot (out[i] <- val
). There's not really anything wrong with loops themselves in R; it's really just memory reallocation that can slow things down.