Search code examples
rweb-scrapinghtml-tablervestrselenium

Scraping a web table through multiple pages (some rows are missing)


I'd like to scrape a table (containing information about 31,385 soldiers) from https://irelandsgreatwardead.ie/the-archive/ using rvest.

library(rvest)
library(dplyr)

page <- read_html(x = "https://irelandsgreatwardead.ie/the-archive/")    
table <- page             %>% 
  html_nodes("table")     %>%  
  html_table(fill = TRUE) %>%
  as.data.frame()

This works, but only for the first 10 soldiers. In the source code, I can only see the information for the first 10 soldiers either. Any help on how to obtain the rows with the other soldiers would be highly appreciated!

Thanks and have a great day!


Solution

  • Here is the RSelenium solution,

    You can loop through page extracting table and joining to the previous table.

    First launch the browser,

      library(RSelenium)
        driver = rsDriver(browser = c("firefox"))
        remDr <- driver[["client"]]
        remDr$navigate(url)
    

    PART 1: Extracting table from first page and storing in df,

    df = remDr$getPageSource()[[1]] %>% 
      read_html() %>%
      html_table() 
    df = df[[1]]
    #removing last row which is non-esstential
    df = df[-nrow(df),]
    

    PART 2: Loop through pages 2 to 5

    for(i in 2:5){ 
    #Building xpath for each page
    xp = paste0('//*[@id="table_1_paginate"]/span/a[', i, ']')
    cc <- remDr$findElement(using = 'xpath', value = xp)
    cc$clickElement()
    
    # Three second gap is given for the webpage to load
    Sys.sleep(3)
    df1 = remDr$getPageSource()[[1]] %>% 
      read_html() %>%
      html_table() 
    df1 = df1[[1]]
    df1 = df1[-nrow(df1),]
    
    #Joining previous table `df` and present table `df1`
    df = rbind(df, df1)
    }
    

    PART 3: Loop through rest of the pages 6 to 628

    The xpath of remaining pages remains the same. Thus we have to repeat this code block 623 times to get table from remaining pages.

    for (i in 1:623) {
    x = i
    cc <- remDr$findElement(using = 'xpath', value = '//*[@id="table_1_paginate"]/span/a[4]')
    cc$clickElement()
    Sys.sleep(3)
    df1 = remDr$getPageSource()[[1]] %>% 
      read_html() %>%
      html_table() 
    df1 = df1[[1]]
    df1 = df1[-nrow(df1),]
    df = rbind(df, df1)
    }
    

    Now we have df with info of all soldiers.