I have a similar problem to this one. I want to download the tables for all years/months in this webpage. I have been able to download the tables that appear when opening the website using the following code:
#######
# Pages
#######
yr.list <- seq(2012,2020)
mes.list <- c("Enero", "Febrero", "Marzo", "Abril", "Mayo", "Junio", "Julio", "Agosto", "Septiembre", "Octubre", "Noviembre", "Diciembre")
c.list <- c("contrata","planta")
################################################
## UTarapaca Scraping Loop PLANTA & CONTRATA
################################################
combined_df <- data.frame()
for (c in c.list){
for (yr in yr.list){
for (mes in mes.list) {
# UChile URL
root <- "https://www.uta.cl/transparencia/"
# Full link
url <- paste(root,c,"/",yr,"/",mes,"/",sep="")
# Parse HTML File
file<-read_html(url)
# Get the nodes were the tables live
tables<-html_nodes(file, "table")
# This is the relevant table
table <- as.data.frame(html_table(tables[1], fill = TRUE))
}
Nonetheless, that code only fetches the 10 registers from the first page (Registros por pagina = 10 in the upper right corner of the table) and what I want is to download all the registers that the wrapped table contains. I tried looping over the different "table pages" (see lower right corner of the table to see pages) but the URL does not change when changing the page.
Any help on this would be greatly appreciated. Bests, Maria
Here is a way with rvest
. First create all links outside any loop. Then lapply
an anonymous function to read each page and extract the tables from those pages.
library(httr)
library(rvest)
library(dplyr)
root <- "https://www.uta.cl/transparencia/"
c.list <- c("contrata","planta")
yr.list <- seq(2012, 2020)
mes.list <- c("Enero", "Febrero", "Marzo", "Abril", "Mayo", "Junio", "Julio", "Agosto", "Septiembre", "Octubre", "Noviembre", "Diciembre")
df_links <- expand.grid(c.list, yr.list, mes.list)
head(df_links)
links <- with(df_links, sprintf("%s%s/%s/%s", root, Var1, Var2, Var3))
length(links)
tables_list <- lapply(links, \(x) {
page <- read_html(x)
tbl_list <- page %>%
html_elements("table") %>%
html_children() %>%
html_table()
names(tbl_list[[2]]) <- names(tbl_list[[1]])
tbl_list[[2]]
})
To create a column with the combination c/mes/year, use the following lapply
loop.
tables_list <- lapply(seq_along(links), \(i) {
x <- links[i]
id <- with(df_links, sprintf("%s/%s/%s", Var1[i], Var3[i], Var2[i]))
page <- read_html(x)
tbl_list <- page %>%
html_elements("table") %>%
html_children() %>%
html_table()
names(tbl_list[[2]]) <- names(tbl_list[[1]])
tbl_list[[2]]$id <- id
tbl_list[[2]]
})
unique(unlist(sapply(tables_list, '[[', 'id')))
#> [1] "contrata/Enero/2012" "planta/Enero/2012"