Iam trying to extract data from a website using a custom function:
library(tidyverse)
library(rvest)
url = "https://www.boerse.de/fundamental-analyse/garbage/" # last part does not change outcome, therefore 'garbage'
read_html_tables = function(ISIN){
content <- read_html(paste0(url,ISIN,"#guv")) %>%
html_table(dec = ",") %>%
.[c(5:10)]
return(content)
}
If I run this function with a given ISIN, e.g. US88579Y1010, I get the desired result. A list containing 6 tibbles with the data I want. But if I wrap this function into lapply()
with a vector containing a few hundred ISIN, I get the following error:
list_of_all <- lapply(X = df[,2], FUN = read_html_tables)
Error: x
must be a string of length 1
Called from: read_xml.character(x, encoding = encoding, ..., as_html = TRUE,
options = options)
If I call which(length(df[,2]) != 1)
(the column where the ISINs are), I get integer(0), so there seems to be no issue with the ISIN column in this dataframe. And since it works with a single ISIN as input, the read_html(paste0(url,ISIN))
part seems to work as well.
I have used a very similar function before and wrapped it into lapply()
. The earlier function did basically exactly what this function does, but had to do some searching and combining for the correct URL to pass into the read_html(paste0(url,ISIN))
part (on another website).
Iam a bit puzzled, since this error did not occure beforehand. But if it occured and I try to run the earlier function now, I get the same error (which I didn't receive any time before).
Maybe there is a more talented R-programmer out there which can spot the issue?
Edit: Since a reply suggested the ISIN-list is the issue:
The first two are US88579Y1010 and US8318652091. Passed individually into the function as well as passing it in a vector (c(ISIN1, ISIN2)
) and passing the vector to lapply works. But if I point at both ISINs inside the tibble (df[1:2,2]
) I get the error from above. What am I missing here?
Solution:
read_xml.character from read_html()
seems to not accept a column from a tibble as valid input. Transfering the tibble to a data.frame and recalculating gives the desired output.