I would like to collect email addresses clicking each name from this website https://ki.se/en/research/professors-at-ki I created the following loop. For some reason some email are not collected, and the code is very slow... Do you have a better code idea? Thank you very much in advance
library(RSelenium)
#use Rselenium to dowload emails
rD <- rsDriver(browser="firefox", port=4545L, verbose=F)
remDr <- rD[["client"]]
remDr$navigate("https://ki.se/en/research/professors-at-ki")
database<-data.frame(NA, nrow = length(name), ncol = 3)
for(i in 1:length(name)){
#first website
remDr$navigate("https://ki.se/en/research/professors-at-ki")
elems <- remDr$findElements(using = 'xpath', "//strong") #all elements to be selected
elem <- elems[[i]] #do search and click on each one
class(elem)
people<- elem$getElementText()
elem$clickElement()
page <- remDr$getPageSource()
#stringplit
p<-str_split(as.character(page), "\n")
a<-grep("@", p[[1]])
if(length(a)>0){
email<-p[[1]][a[2]]
email<-gsub(" ", "", email)
database[i,1]<-people
database[i,2]<-email
database[i,3]<-"Karolinska Institute"
}
}
RSelenium
is usually not the fastest approach as it requires the browser to load the page. There are cases, when RSelenium
is the only option, but in this case, you can achieve what you need using rvest
library, which should be faster. As for the errors you receive, there are two professors, for which the links provided do not seem to be working, thus the errors you receive.
library(rvest)
library(tidyverse)
# getting links to professors microsites as part of the KI main website
r <- read_html("https://ki.se/en/research/professors-at-ki")
people_links <- r %>%
html_nodes("a") %>%
html_attrs() %>%
as.character() %>%
str_subset("https://staff.ki.se/people/")
# accessing the obtained links, getting the e-mails
df <- tibble(people_links) %>%
# filtering out these links as they do not seem to be accessible
filter( !(people_links %in% c("https://staff.ki.se/people/gungra", "https://staff.ki.se/people/evryla")) ) %>%
rowwise() %>%
mutate(
mail = read_html(people_links) %>%
html_nodes("a") %>%
html_attrs() %>%
as.character() %>%
str_subset("mailto:") %>%
str_remove("mailto:")
)