Search code examples
rweb-scrapingrvestsapply

Problem with webscraping using rvest and sapply, "`x` must be a string of length 1"


I am new to webscraping and R and trying to webscrape the names of all Professors of a faculty with the following code:

library(rvest)
library(dplyr)

link = "https://wiso.uni-koeln.de/de/fakultaet/fakultaetsbereiche"
page = read_html(link)

fac_area = page %>% html_nodes("#subnavigation a") %>% html_text()
link_area = page %>% html_nodes("#subnavigation a") %>% html_attr("href") %>% paste("https://wiso.uni-koeln.de/de/fakultaet/fakultaetsbereiche", ., sep= "")

Prof = function(link_areas){
  area = read_html(link_area)
  chair_prof = area %>% html_nodes (".uzk15__standard_h3") %>%
    html_text() %>% paste(collapse = ",")
  return(chair_prof)
}

profs = sapply(link_area, FUN = Prof, USE.NAMES = FALSE) 

But I get the Error:

"x must be a string of length 1"

I don't understand if this error is due to a mistake in the function or in sapply, the function itself does not give me an error message and the link_area list is excactly what I would want it to be.


Solution

  • There were two minor errors.

    First, your link_area pasted too much into the URLs. The line should have been:

    link_area = page %>% html_nodes("#subnavigation a") %>% html_attr("href") %>% paste("https://wiso.uni-koeln.de", ., sep= "")
    

    (Otherwise the URLs lead to a 404 error which is why you had the error message)

    Secondly, in the function Profs, the first line had a typo (link_area even though it should have been link_areas).

    So, the full code should be:

    library(rvest)
    library(dplyr)
    
    link = "https://wiso.uni-koeln.de/de/fakultaet/fakultaetsbereiche"
    page = read_html(link)
    
    fac_area = page %>% html_nodes("#subnavigation a") %>% html_text()
    link_area = page %>% html_nodes("#subnavigation a") %>% html_attr("href") %>% paste("https://wiso.uni-koeln.de", ., sep= "")
    # ^^^ note the shortened URL in the paste()-function
    
    Prof = function(link_areas){
      area = read_html(link_areas) # <---- note there was a typo here
      chair_prof = area %>% html_nodes (".uzk15__standard_h3") %>%
        html_text() %>% paste(collapse = ",")
      return(chair_prof)
    }
    
    profs = sapply(link_area, FUN = Prof, USE.NAMES = FALSE)