Search code examples
rrvestreadxml

lapply and read_xml.character


Iam trying to extract data from a website using a custom function:

library(tidyverse)
library(rvest)
url = "https://www.boerse.de/fundamental-analyse/garbage/" # last part does not change outcome, therefore 'garbage'
read_html_tables = function(ISIN){
  content <- read_html(paste0(url,ISIN,"#guv")) %>%
    html_table(dec = ",") %>%
    .[c(5:10)]
  return(content)
}

If I run this function with a given ISIN, e.g. US88579Y1010, I get the desired result. A list containing 6 tibbles with the data I want. But if I wrap this function into lapply() with a vector containing a few hundred ISIN, I get the following error:

list_of_all <- lapply(X = df[,2], FUN = read_html_tables)

Error: x must be a string of length 1 Called from: read_xml.character(x, encoding = encoding, ..., as_html = TRUE, options = options)

If I call which(length(df[,2]) != 1) (the column where the ISINs are), I get integer(0), so there seems to be no issue with the ISIN column in this dataframe. And since it works with a single ISIN as input, the read_html(paste0(url,ISIN)) part seems to work as well.

I have used a very similar function before and wrapped it into lapply(). The earlier function did basically exactly what this function does, but had to do some searching and combining for the correct URL to pass into the read_html(paste0(url,ISIN)) part (on another website). Iam a bit puzzled, since this error did not occure beforehand. But if it occured and I try to run the earlier function now, I get the same error (which I didn't receive any time before).

Maybe there is a more talented R-programmer out there which can spot the issue?

Edit: Since a reply suggested the ISIN-list is the issue: The first two are US88579Y1010 and US8318652091. Passed individually into the function as well as passing it in a vector (c(ISIN1, ISIN2)) and passing the vector to lapply works. But if I point at both ISINs inside the tibble (df[1:2,2]) I get the error from above. What am I missing here?


Solution

  • Solution: read_xml.character from read_html() seems to not accept a column from a tibble as valid input. Transfering the tibble to a data.frame and recalculating gives the desired output.