I'm trying to download a parameterized .csv
file from a page. I'm quite new to scraping but it seems to me that I need to fill out a form with desired values and then submit it to receive a response (the file itself).
So I've been trying to make it happen using rvest
- but have not succeeded so far. Three similar approaches were used, returning two different error messages.
url <- 'https://www.anbima.com.br/informacoes/est-termo/default.asp'
# read_html
sess <- read_html(url)
form <- sess %>%
html_form() %>%
.[[1]] %>%
html_form_set(escolha = 2, Idioma = "PT", saida = "csv",
Dt_Ref = "13/06/2024")
resp <- html_form_submit(form)
read_html(resp)
# > Warning message: In session_set_response(x, resp) : Internal Server Error (HTTP 500).
# read_html_live
sess <- read_html_live(url)
form <- sess %>%
html_elements("form") %>%
html_form() %>%
.[[1]] %>%
html_form_set(escolha = 2, Idioma = "PT", saida = "csv",
Dt_Ref = "13/06/2024")
resp <- html_form_submit(form)
# > Error in curl::curl_fetch_memory(url, handle = handle) : Could not resolve host: CZ.asp
# session
sess <- session(url)
form <- sess %>%
read_html() %>%
html_form() %>%
.[[1]] %>%
html_form_set(escolha = 2, Idioma = "PT", saida = "csv",
Dt_Ref = "13/06/2024")
resp <- session_submit(sess, form)
# > Warning message: In session_set_response(x, resp) : Internal Server Error (HTTP 500).
Additionally, it appears the form when submitted on the page goes directly to a js function called 'VerificaSubmit()'. Researching the topic led me to find this SO post where in the comments was said that 'rvest
cannot execute javascript', at the same time as the penultimate comment left in the air that a solution was possible.
My question: is it possible to scraping sites like this one using rvest
or do I need other package?
Thanks!
You certainly can use rvest
to generate a request for that csv, but I don't believe it properly supports radio buttons in a form; plus javascript does play a role here and rvest
itself(!) is not able to execute it.
That said, linked post pre-dates rvest::read_html_live()
& rvest
integration with chromote
(at least when considering CRAN releases), though just replacing read_html()
with read_html_live()
makes virtually no difference in this particular case as initial page view is static anyway and javascript kicks in only while user interacts with the form. In other words, for read_html_live()
approach you should drop html_form()
+ html_form_set()
and interact with returned LiveHTML
object through its methods.
For now let's just debug your first approach, we'll check unchanged forms
object and capture actual request by wrapping html_form_submit()
with httr::with_verbose()
:
library(rvest)
library(httr)
url_ <- 'https://www.anbima.com.br/informacoes/est-termo/default.asp'
html <- read_html(url_)
form <-
html |>
html_element("form") |>
html_form()
form
#> <form> 'CURVA_Z' (POST https://www.anbima.com.br/informacoes/est-termo/CZ.asp)
#> <field> (radio) escolha: 1
#> <field> (radio) escolha: 2
#> <field> (radio) Idioma: PT
#> <field> (radio) Idioma: US
#> <field> (radio) saida: xls
#> <field> (radio) saida: csv
#> <field> (radio) saida: txt
#> <field> (radio) saida: xml
#> <field> (hidden) Dt_Ref_Ver: 20240610
#> <field> (text) Dt_Ref: 17/06/2024
with_verbose(
html_form_submit(form)
)
#> -> POST /informacoes/est-termo/CZ.asp HTTP/1.1
#> ...
#> >> escolha=1&escolha=2&Idioma=PT&Idioma=US&saida=xls&saida=csv&saida=txt&saida=xml&Dt_Ref_Ver=20240610&Dt_Ref=17%2F06%2F2024
#> <- HTTP/1.1 500 Internal Server Error
#> ...
#> Response [https://www.anbima.com.br/informacoes/est-termo/CZ.asp]
#> Date: 2024-06-18 09:19
#> Status: 500
#> Content-Type: text/html
#> Size: 1.21 kB
2 things to note here,
action
is still CZ.asp
, but when you check the VerificaSubmit()
js function or requests made by your js-enabled browser, you should see that it must be changed to CZ-down.asp
when user chooses "Download":function VerificaSubmit() {
...
if (document.CURVA_Z.escolha[0].checked) {
vForm.action = "CZ.asp";
document.CURVA_Z.Idioma[0].checked = true;
} else {
vForm.action = "CZ-down.asp";
vForm.target = "framedown";
}
vForm.submit();
}
html_form_submit()
just builds a request from all existing form elements, i.e. you end up having all radio button values ( escolha=1&escolha=2&Idioma=PT&Idioma=US&...
) in your request, the reason for HTTP error 500. Radio inputs share names, with ... %>% html_form_set(escolha = 2, ...)
you just set the value for the first element named escolha
, it doesn't remove other elements with the same name.What if we update form action ourselves and instead of using html_form_set()
, we only keep form elements that we do not want to change and rebuild form element list?
# update action
form$action <- gsub("CZ.asp", "CZ-down.asp", form$action, fixed = TRUE)
# named list of form fields to set
form_values <-
list(escolha = 2,
Idioma = "PT",
saida = "csv",
Dt_Ref = "13/06/2024")
# remove fields that are not in our list (there's 1 hidden input we may need to keep)
keep_fields <- form$fields[!names(form$fields) %in% names(form_values)]
# generate a new list of fields, type doesn't make much difference
fake_fields <- purrr::imap(form_values,
\(value_, name_) rvest:::rvest_field(type = "text", name = name_,
value = value_, attr = NA))
# combine 2 lists, check updated form
form$fields <- c(keep_fields, fake_fields)
form
#> <form> 'CURVA_Z' (POST https://www.anbima.com.br/informacoes/est-termo/CZ-down.asp)
#> <field> (hidden) Dt_Ref_Ver: 20240610
#> <field> (text) escolha: 2
#> <field> (text) Idioma: PT
#> <field> (text) saida: csv
#> <field> (text) Dt_Ref: 13/06/2024
with_verbose(
resp <- html_form_submit(form)
)
#> -> POST /informacoes/est-termo/CZ-down.asp HTTP/1.1
#> ...
#> >> Dt_Ref_Ver=20240610&escolha=2&Idioma=PT&saida=csv&Dt_Ref=13%2F06%2F2024
#> <- HTTP/1.1 200 OK
#> ...
# while we do get a valid reponse ...
resp
#> Response [https://www.anbima.com.br/informacoes/est-termo/CZ-down.asp]
#> Date: 2024-06-18 10:00
#> Status: 200
#> Content-Type: text/csv
#> Size: 3.1 kB
#> NA
# ... we need to set encoding ourselves when accessing response content
content(resp, as = "text", encoding = "iso-8859-1") |>
readr::read_lines() |>
stringr::str_view()
#> [1] │ 13/06/2024;Beta 1;Beta 2;Beta 3;Beta 4;Lambda 1;Lambda 2
#> [2] │ PREFIXADOS;0,116173647818776;-1,55195864358588E-02;-0,022464285391532;5,15094292122757E-02;0,678739835169317;0,417088286334441
#> [3] │ IPCA;7,00351571802289E-02;4,71475563073708E-02;-7,55570010258052E-02;-2,09557600141436E-02;3,23354363268509;9,50245531402194E-02
#> [4] │
#> [5] │ ETTJ Inflação Implicita (IPCA)
#> [6] │ Vertices;ETTJ IPCA;ETTJ PREF;Inflação Implícita
#> [7] │ 126;7,0470;10,4642;3,1922
#> [8] │ 252;6,3639;10,8168;4,1864
#> [9] │ 378;6,3456;11,1196;4,4891
#> [10] │ 504;6,4010;11,3741;4,6739
#> [11] │ 630;6,4417;11,5837;4,8308
#> [12] │ 756;6,4635;11,7533;4,9686
#> [13] │ 882;6,4725;11,8883;5,0865
#> [14] │ 1.008;6,4735;11,9936;5,1844
#> [15] │ 1.134;6,4696;12,0742;5,2640
#> [16] │ 1.260;6,4627;12,1341;5,3271
#> [17] │ 1.386;6,4542;12,1773;5,3761
#> [18] │ 1.512;6,4447;12,2068;5,4132
#> [19] │ 1.638;6,4349;12,2254;5,4404
#> [20] │ 1.764;6,4249;12,2353;5,4596
#> ... and 107 more
As we gathered all required details for crafting that POST request ourselves, we can just move away from rvest
and use httr
or httr2
directly:
library(httr2)
library(readr)
library(stringr)
tmp_out <- tempfile()
# make a POST request, save result to tempdir
resp <-
request("https://www.anbima.com.br/informacoes/est-termo/CZ-down.asp") |>
# hidden Dt_Ref_Ver seems to be optional
req_body_form(escolha = 2,
Idioma = "PT",
saida = "csv",
Dt_Ref = "13/06/2024") |>
req_perform(path = tmp_out)
resp
#> <httr2_response>
#> POST https://www.anbima.com.br/informacoes/est-termo/CZ-down.asp
#> Status: 200 OK
#> Content-Type: text/csv
#> Body: On disk
#> 'C:\Users\marguslt\AppData\Local\Temp\Rtmp2FGlY2\file37402916472b' (3095 bytes)
# check for status
if (!resp_is_error(resp)){
# guess encoding:
resp_body_raw(resp) |>
stringi::stri_enc_detect()
#> [[1]]
#> Encoding Language Confidence
#> 1 ISO-8859-1 pt 0.33
#> 2 ISO-8859-9 tr 0.19
#> 3 ISO-8859-2 hu 0.11
#> 4 UTF-16BE 0.10
#> 5 UTF-16LE 0.10
#> 6 Shift_JIS ja 0.10
#> 7 GB18030 zh 0.10
#> 8 Big5 zh 0.10
# extract file name from header, create a copy to working directory,
# check resulting file
dest <- stringr::str_split_i(resp$headers$`content-disposition`, "=", 2)
file.copy(tmp_out, dest)
fs::file_info(dest)[1:3]
#> # A tibble: 1 × 3
#> path type size
#> <fs::path> <fct> <fs::bytes>
#> 1 CurvaZero_13062024.csv file 3.02K
# read file, provide encoding
read_lines(dest, locale = locale(encoding = "ISO-8859-1")) |>
str_view()
#> [1] │ 13/06/2024;Beta 1;Beta 2;Beta 3;Beta 4;Lambda 1;Lambda 2
#> [2] │ PREFIXADOS;0,116173647818776;-1,55195864358588E-02;-0,022464285391532;5,15094292122757E-02;0,678739835169317;0,417088286334441
#> [3] │ IPCA;7,00351571802289E-02;4,71475563073708E-02;-7,55570010258052E-02;-2,09557600141436E-02;3,23354363268509;9,50245531402194E-02
#> [4] │
#> [5] │ ETTJ Inflação Implicita (IPCA)
#> [6] │ Vertices;ETTJ IPCA;ETTJ PREF;Inflação Implícita
#> [7] │ 126;7,0470;10,4642;3,1922
#> [8] │ 252;6,3639;10,8168;4,1864
#> [9] │ 378;6,3456;11,1196;4,4891
#> [10] │ 504;6,4010;11,3741;4,6739
#> [11] │ 630;6,4417;11,5837;4,8308
#> [12] │ 756;6,4635;11,7533;4,9686
#> [13] │ 882;6,4725;11,8883;5,0865
#> [14] │ 1.008;6,4735;11,9936;5,1844
#> [15] │ 1.134;6,4696;12,0742;5,2640
#> [16] │ 1.260;6,4627;12,1341;5,3271
#> [17] │ 1.386;6,4542;12,1773;5,3761
#> [18] │ 1.512;6,4447;12,2068;5,4132
#> [19] │ 1.638;6,4349;12,2254;5,4404
#> [20] │ 1.764;6,4249;12,2353;5,4596
#> ... and 107 more
}else{
resp_status_desc(resp)
}