Search code examples
rrselenium

Web Scraping on multiple pages with RSelenium and select emails with regular expression


I would like to collect email addresses clicking each name from this website https://ki.se/en/research/professors-at-ki I created the following loop. For some reason some email are not collected, and the code is very slow... Do you have a better code idea? Thank you very much in advance

library(RSelenium)

#use Rselenium to dowload emails
rD <- rsDriver(browser="firefox", port=4545L, verbose=F)
remDr <- rD[["client"]]
remDr$navigate("https://ki.se/en/research/professors-at-ki")


database<-data.frame(NA, nrow = length(name), ncol = 3)

for(i in 1:length(name)){
  #first website
  remDr$navigate("https://ki.se/en/research/professors-at-ki")
  elems <- remDr$findElements(using = 'xpath', "//strong")   #all elements to be selected
  elem <- elems[[i]] #do search and click on each one
  class(elem)
 people<- elem$getElementText()
  elem$clickElement()
  page <- remDr$getPageSource()
  #stringplit
  p<-str_split(as.character(page), "\n")
  a<-grep("@", p[[1]])

  if(length(a)>0){
    email<-p[[1]][a[2]]
    email<-gsub(" ", "", email)        
    database[i,1]<-people
    database[i,2]<-email
    database[i,3]<-"Karolinska Institute"
  }
}



Solution

  • RSelenium is usually not the fastest approach as it requires the browser to load the page. There are cases, when RSelenium is the only option, but in this case, you can achieve what you need using rvest library, which should be faster. As for the errors you receive, there are two professors, for which the links provided do not seem to be working, thus the errors you receive.

    library(rvest)
    library(tidyverse)
    
    # getting links to professors microsites as part of the KI main website
    r <- read_html("https://ki.se/en/research/professors-at-ki")
    
    people_links <- r %>%
      html_nodes("a") %>%
      html_attrs() %>%
      as.character() %>%
      str_subset("https://staff.ki.se/people/")
    
    # accessing the obtained links, getting the e-mails
    df <- tibble(people_links) %>%
      # filtering out these links as they do not seem to be accessible
      filter( !(people_links %in% c("https://staff.ki.se/people/gungra", "https://staff.ki.se/people/evryla")) ) %>%
      rowwise() %>%
      mutate(
        mail = read_html(people_links) %>%
          html_nodes("a") %>%
          html_attrs() %>%
          as.character() %>%
          str_subset("mailto:") %>%
          str_remove("mailto:")
      )