I am new to webscraping and R and trying to webscrape the names of all Professors of a faculty with the following code:
library(rvest)
library(dplyr)
link = "https://wiso.uni-koeln.de/de/fakultaet/fakultaetsbereiche"
page = read_html(link)
fac_area = page %>% html_nodes("#subnavigation a") %>% html_text()
link_area = page %>% html_nodes("#subnavigation a") %>% html_attr("href") %>% paste("https://wiso.uni-koeln.de/de/fakultaet/fakultaetsbereiche", ., sep= "")
Prof = function(link_areas){
area = read_html(link_area)
chair_prof = area %>% html_nodes (".uzk15__standard_h3") %>%
html_text() %>% paste(collapse = ",")
return(chair_prof)
}
profs = sapply(link_area, FUN = Prof, USE.NAMES = FALSE)
But I get the Error:
"
x
must be a string of length 1"
I don't understand if this error is due to a mistake in the function or in sapply
, the function itself does not give me an error message and the link_area
list is excactly what I would want it to be.
There were two minor errors.
First, your link_area
pasted too much into the URLs. The line should have been:
link_area = page %>% html_nodes("#subnavigation a") %>% html_attr("href") %>% paste("https://wiso.uni-koeln.de", ., sep= "")
(Otherwise the URLs lead to a 404 error which is why you had the error message)
Secondly, in the function Profs
, the first line had a typo (link_area
even though it should have been link_areas
).
So, the full code should be:
library(rvest)
library(dplyr)
link = "https://wiso.uni-koeln.de/de/fakultaet/fakultaetsbereiche"
page = read_html(link)
fac_area = page %>% html_nodes("#subnavigation a") %>% html_text()
link_area = page %>% html_nodes("#subnavigation a") %>% html_attr("href") %>% paste("https://wiso.uni-koeln.de", ., sep= "")
# ^^^ note the shortened URL in the paste()-function
Prof = function(link_areas){
area = read_html(link_areas) # <---- note there was a typo here
chair_prof = area %>% html_nodes (".uzk15__standard_h3") %>%
html_text() %>% paste(collapse = ",")
return(chair_prof)
}
profs = sapply(link_area, FUN = Prof, USE.NAMES = FALSE)