Search code examples
htmlrweb-scrapingrvest

Webscraping using Rvest for wrapped tables


I have a similar problem to this one. I want to download the tables for all years/months in this webpage. I have been able to download the tables that appear when opening the website using the following code:

#######
# Pages 
#######
yr.list <- seq(2012,2020)
mes.list <- c("Enero", "Febrero", "Marzo", "Abril", "Mayo", "Junio", "Julio", "Agosto", "Septiembre", "Octubre", "Noviembre", "Diciembre")
c.list <- c("contrata","planta")

################################################
## UTarapaca Scraping Loop PLANTA & CONTRATA
################################################

combined_df <- data.frame()
for (c in c.list){
for (yr in yr.list){
  for (mes in mes.list) {
      # UChile URL
      root <- "https://www.uta.cl/transparencia/"
      
      # Full link    
      url <- paste(root,c,"/",yr,"/",mes,"/",sep="")
      
      # Parse HTML File
      file<-read_html(url)
      
      # Get the nodes were the tables live
      tables<-html_nodes(file, "table")
      
      # This is the relevant table
      table <- as.data.frame(html_table(tables[1], fill = TRUE))
    }

Nonetheless, that code only fetches the 10 registers from the first page (Registros por pagina = 10 in the upper right corner of the table) and what I want is to download all the registers that the wrapped table contains. I tried looping over the different "table pages" (see lower right corner of the table to see pages) but the URL does not change when changing the page.

Any help on this would be greatly appreciated. Bests, Maria


Solution

  • Here is a way with rvest. First create all links outside any loop. Then lapply an anonymous function to read each page and extract the tables from those pages.

    library(httr)
    library(rvest)
    library(dplyr)
    
    root <- "https://www.uta.cl/transparencia/"
    c.list <- c("contrata","planta")
    yr.list <- seq(2012, 2020)
    mes.list <- c("Enero", "Febrero", "Marzo", "Abril", "Mayo", "Junio", "Julio", "Agosto", "Septiembre", "Octubre", "Noviembre", "Diciembre")
    
    df_links <- expand.grid(c.list, yr.list, mes.list)
    head(df_links)
    
    links <- with(df_links, sprintf("%s%s/%s/%s", root, Var1, Var2, Var3))
    length(links)
    
    tables_list <- lapply(links, \(x) {
      page <- read_html(x)
      tbl_list <- page %>%
        html_elements("table") %>%
        html_children() %>%
        html_table()
      names(tbl_list[[2]]) <- names(tbl_list[[1]])
      tbl_list[[2]]
    })
    

    Edit

    To create a column with the combination c/mes/year, use the following lapply loop.

    tables_list <- lapply(seq_along(links), \(i) {
      x <- links[i]
      id <- with(df_links, sprintf("%s/%s/%s", Var1[i], Var3[i], Var2[i]))
      page <- read_html(x)
      tbl_list <- page %>%
        html_elements("table") %>%
        html_children() %>%
        html_table()
      names(tbl_list[[2]]) <- names(tbl_list[[1]])
      tbl_list[[2]]$id <- id
      tbl_list[[2]]
    })
    
    unique(unlist(sapply(tables_list, '[[', 'id')))
    #> [1] "contrata/Enero/2012" "planta/Enero/2012"