I'd like to scrape a table (containing information about 31,385 soldiers) from https://irelandsgreatwardead.ie/the-archive/ using rvest
.
library(rvest)
library(dplyr)
page <- read_html(x = "https://irelandsgreatwardead.ie/the-archive/")
table <- page %>%
html_nodes("table") %>%
html_table(fill = TRUE) %>%
as.data.frame()
This works, but only for the first 10 soldiers. In the source code, I can only see the information for the first 10 soldiers either. Any help on how to obtain the rows with the other soldiers would be highly appreciated!
Thanks and have a great day!
Here is the RSelenium
solution,
You can loop through page extracting table and joining to the previous table.
First launch the browser,
library(RSelenium)
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$navigate(url)
PART 1: Extracting table from first page and storing in df
,
df = remDr$getPageSource()[[1]] %>%
read_html() %>%
html_table()
df = df[[1]]
#removing last row which is non-esstential
df = df[-nrow(df),]
PART 2: Loop through pages 2 to 5
for(i in 2:5){
#Building xpath for each page
xp = paste0('//*[@id="table_1_paginate"]/span/a[', i, ']')
cc <- remDr$findElement(using = 'xpath', value = xp)
cc$clickElement()
# Three second gap is given for the webpage to load
Sys.sleep(3)
df1 = remDr$getPageSource()[[1]] %>%
read_html() %>%
html_table()
df1 = df1[[1]]
df1 = df1[-nrow(df1),]
#Joining previous table `df` and present table `df1`
df = rbind(df, df1)
}
PART 3: Loop through rest of the pages 6 to 628
The xpath
of remaining pages remains the same. Thus we have to repeat this code block 623 times to get table from remaining pages.
for (i in 1:623) {
x = i
cc <- remDr$findElement(using = 'xpath', value = '//*[@id="table_1_paginate"]/span/a[4]')
cc$clickElement()
Sys.sleep(3)
df1 = remDr$getPageSource()[[1]] %>%
read_html() %>%
html_table()
df1 = df1[[1]]
df1 = df1[-nrow(df1),]
df = rbind(df, df1)
}
Now we have df
with info of all soldiers.