Search code examples
rweb-scrapingrvest

rvest Error HTTP 500 when submitting form to download csv file


I'm trying to download a parameterized .csv file from a page. I'm quite new to scraping but it seems to me that I need to fill out a form with desired values and then submit it to receive a response (the file itself).

So I've been trying to make it happen using rvest - but have not succeeded so far. Three similar approaches were used, returning two different error messages.

url <- 'https://www.anbima.com.br/informacoes/est-termo/default.asp'
  
  
  # read_html
  sess <- read_html(url)
  form <- sess %>% 
            html_form() %>%
            .[[1]] %>% 
            html_form_set(escolha = 2, Idioma = "PT", saida = "csv", 
                          Dt_Ref = "13/06/2024")
  resp <- html_form_submit(form)
  read_html(resp)
  # > Warning message: In session_set_response(x, resp) : Internal Server Error (HTTP 500).
      
  # read_html_live
  sess <- read_html_live(url)
  form <- sess %>% 
            html_elements("form") %>% 
            html_form() %>%
            .[[1]] %>% 
            html_form_set(escolha = 2, Idioma = "PT", saida = "csv", 
                          Dt_Ref = "13/06/2024")
  resp <- html_form_submit(form)
  # > Error in curl::curl_fetch_memory(url, handle = handle) : Could not resolve host: CZ.asp  

  # session
  sess <- session(url)
  form <- sess %>% 
            read_html() %>% 
            html_form() %>% 
            .[[1]] %>% 
            html_form_set(escolha = 2, Idioma = "PT", saida = "csv", 
                          Dt_Ref = "13/06/2024")
  resp <- session_submit(sess, form)
  # > Warning message: In session_set_response(x, resp) : Internal Server Error (HTTP 500).

Additionally, it appears the form when submitted on the page goes directly to a js function called 'VerificaSubmit()'. Researching the topic led me to find this SO post where in the comments was said that 'rvest cannot execute javascript', at the same time as the penultimate comment left in the air that a solution was possible.

My question: is it possible to scraping sites like this one using rvest or do I need other package?

Thanks!


Solution

  • You certainly can use rvest to generate a request for that csv, but I don't believe it properly supports radio buttons in a form; plus javascript does play a role here and rvest itself(!) is not able to execute it.

    That said, linked post pre-dates rvest::read_html_live() & rvest integration with chromote (at least when considering CRAN releases), though just replacing read_html() with read_html_live() makes virtually no difference in this particular case as initial page view is static anyway and javascript kicks in only while user interacts with the form. In other words, for read_html_live() approach you should drop html_form() + html_form_set() and interact with returned LiveHTML object through its methods.


    For now let's just debug your first approach, we'll check unchanged forms object and capture actual request by wrapping html_form_submit() with httr::with_verbose():

    library(rvest)
    library(httr)
    
    url_ <- 'https://www.anbima.com.br/informacoes/est-termo/default.asp'
    html <- read_html(url_)
    
    form <- 
      html |>
      html_element("form") |>
      html_form() 
    form
    #> <form> 'CURVA_Z' (POST https://www.anbima.com.br/informacoes/est-termo/CZ.asp)
    #>   <field> (radio) escolha: 1
    #>   <field> (radio) escolha: 2
    #>   <field> (radio) Idioma: PT
    #>   <field> (radio) Idioma: US
    #>   <field> (radio) saida: xls
    #>   <field> (radio) saida: csv
    #>   <field> (radio) saida: txt
    #>   <field> (radio) saida: xml
    #>   <field> (hidden) Dt_Ref_Ver: 20240610
    #>   <field> (text) Dt_Ref: 17/06/2024
    
    with_verbose(
      html_form_submit(form)
    )
    #> -> POST /informacoes/est-termo/CZ.asp HTTP/1.1
    #> ... 
    #> >> escolha=1&escolha=2&Idioma=PT&Idioma=US&saida=xls&saida=csv&saida=txt&saida=xml&Dt_Ref_Ver=20240610&Dt_Ref=17%2F06%2F2024
    #> <- HTTP/1.1 500 Internal Server Error
    #> ...
    
    #> Response [https://www.anbima.com.br/informacoes/est-termo/CZ.asp]
    #>   Date: 2024-06-18 09:19
    #>   Status: 500
    #>   Content-Type: text/html
    #>   Size: 1.21 kB
    

    2 things to note here,

    • form's action is still CZ.asp, but when you check the VerificaSubmit() js function or requests made by your js-enabled browser, you should see that it must be changed to CZ-down.asp when user chooses "Download":
    function VerificaSubmit() {
        ...
        if (document.CURVA_Z.escolha[0].checked) {
            vForm.action = "CZ.asp"; 
            document.CURVA_Z.Idioma[0].checked = true;
        } else {
            vForm.action = "CZ-down.asp";
            vForm.target = "framedown";
        }
        vForm.submit();
    }
    
    • html_form_submit() just builds a request from all existing form elements, i.e. you end up having all radio button values ( escolha=1&escolha=2&Idioma=PT&Idioma=US&... ) in your request, the reason for HTTP error 500. Radio inputs share names, with ... %>% html_form_set(escolha = 2, ...) you just set the value for the first element named escolha, it doesn't remove other elements with the same name.

    What if we update form action ourselves and instead of using html_form_set(), we only keep form elements that we do not want to change and rebuild form element list?

    # update action
    form$action <- gsub("CZ.asp", "CZ-down.asp", form$action, fixed = TRUE)
    
    # named list of form fields to set
    form_values <- 
      list(escolha = 2, 
           Idioma = "PT", 
           saida = "csv", 
           Dt_Ref = "13/06/2024")
    
    # remove fields that are not in our list (there's 1 hidden input we may need to keep)
    keep_fields <- form$fields[!names(form$fields) %in% names(form_values)]
    # generate a new list of fields, type doesn't make much difference
    fake_fields <- purrr::imap(form_values, 
                        \(value_, name_) rvest:::rvest_field(type = "text", name = name_, 
                                                             value = value_, attr = NA)) 
    # combine 2 lists, check updated form
    form$fields <- c(keep_fields, fake_fields)
    form
    #> <form> 'CURVA_Z' (POST https://www.anbima.com.br/informacoes/est-termo/CZ-down.asp)
    #>   <field> (hidden) Dt_Ref_Ver: 20240610
    #>   <field> (text) escolha: 2
    #>   <field> (text) Idioma: PT
    #>   <field> (text) saida: csv
    #>   <field> (text) Dt_Ref: 13/06/2024
    
    with_verbose(
      resp <- html_form_submit(form)
    )
    #> -> POST /informacoes/est-termo/CZ-down.asp HTTP/1.1
    #> ...
    #> >> Dt_Ref_Ver=20240610&escolha=2&Idioma=PT&saida=csv&Dt_Ref=13%2F06%2F2024
    #> <- HTTP/1.1 200 OK
    #> ...
    
    # while we do get a valid reponse ...
    resp
    #> Response [https://www.anbima.com.br/informacoes/est-termo/CZ-down.asp]
    #>   Date: 2024-06-18 10:00
    #>   Status: 200
    #>   Content-Type: text/csv
    #>   Size: 3.1 kB
    #> NA
    
    # ... we need to set encoding ourselves when accessing response content
    content(resp, as = "text", encoding = "iso-8859-1") |> 
      readr::read_lines() |>
      stringr::str_view()
    #>  [1] │ 13/06/2024;Beta 1;Beta 2;Beta 3;Beta 4;Lambda 1;Lambda 2
    #>  [2] │ PREFIXADOS;0,116173647818776;-1,55195864358588E-02;-0,022464285391532;5,15094292122757E-02;0,678739835169317;0,417088286334441
    #>  [3] │ IPCA;7,00351571802289E-02;4,71475563073708E-02;-7,55570010258052E-02;-2,09557600141436E-02;3,23354363268509;9,50245531402194E-02
    #>  [4] │ 
    #>  [5] │ ETTJ Inflação Implicita (IPCA)
    #>  [6] │ Vertices;ETTJ IPCA;ETTJ PREF;Inflação Implícita
    #>  [7] │ 126;7,0470;10,4642;3,1922
    #>  [8] │ 252;6,3639;10,8168;4,1864
    #>  [9] │ 378;6,3456;11,1196;4,4891
    #> [10] │ 504;6,4010;11,3741;4,6739
    #> [11] │ 630;6,4417;11,5837;4,8308
    #> [12] │ 756;6,4635;11,7533;4,9686
    #> [13] │ 882;6,4725;11,8883;5,0865
    #> [14] │ 1.008;6,4735;11,9936;5,1844
    #> [15] │ 1.134;6,4696;12,0742;5,2640
    #> [16] │ 1.260;6,4627;12,1341;5,3271
    #> [17] │ 1.386;6,4542;12,1773;5,3761
    #> [18] │ 1.512;6,4447;12,2068;5,4132
    #> [19] │ 1.638;6,4349;12,2254;5,4404
    #> [20] │ 1.764;6,4249;12,2353;5,4596
    #> ... and 107 more
    

    As we gathered all required details for crafting that POST request ourselves, we can just move away from rvest and use httr or httr2 directly:

    library(httr2)
    library(readr)
    library(stringr)
    
    tmp_out <- tempfile()
    
    # make a POST request, save result to tempdir
    resp <- 
      request("https://www.anbima.com.br/informacoes/est-termo/CZ-down.asp") |>
      # hidden Dt_Ref_Ver seems to be optional 
      req_body_form(escolha =  2,
                    Idioma  = "PT",
                    saida   = "csv",
                    Dt_Ref  = "13/06/2024") |>
      req_perform(path = tmp_out) 
    resp
    #> <httr2_response>
    #> POST https://www.anbima.com.br/informacoes/est-termo/CZ-down.asp
    #> Status: 200 OK
    #> Content-Type: text/csv
    #> Body: On disk
    #> 'C:\Users\marguslt\AppData\Local\Temp\Rtmp2FGlY2\file37402916472b' (3095 bytes)
    
    # check for status
    if (!resp_is_error(resp)){
      # guess encoding:
      resp_body_raw(resp) |> 
        stringi::stri_enc_detect()
    #> [[1]]
    #>     Encoding Language Confidence
    #> 1 ISO-8859-1       pt       0.33
    #> 2 ISO-8859-9       tr       0.19
    #> 3 ISO-8859-2       hu       0.11
    #> 4   UTF-16BE                0.10
    #> 5   UTF-16LE                0.10
    #> 6  Shift_JIS       ja       0.10
    #> 7    GB18030       zh       0.10
    #> 8       Big5       zh       0.10
      
      # extract file name from header, create a copy to working directory, 
      # check resulting file
      dest <- stringr::str_split_i(resp$headers$`content-disposition`, "=", 2)
      file.copy(tmp_out, dest)
      fs::file_info(dest)[1:3]
    #> # A tibble: 1 × 3
    #>   path                   type         size
    #>   <fs::path>             <fct> <fs::bytes>
    #> 1 CurvaZero_13062024.csv file        3.02K
      
      # read file, provide encoding
      read_lines(dest, locale = locale(encoding = "ISO-8859-1")) |> 
        str_view()
    #>  [1] │ 13/06/2024;Beta 1;Beta 2;Beta 3;Beta 4;Lambda 1;Lambda 2
    #>  [2] │ PREFIXADOS;0,116173647818776;-1,55195864358588E-02;-0,022464285391532;5,15094292122757E-02;0,678739835169317;0,417088286334441
    #>  [3] │ IPCA;7,00351571802289E-02;4,71475563073708E-02;-7,55570010258052E-02;-2,09557600141436E-02;3,23354363268509;9,50245531402194E-02
    #>  [4] │ 
    #>  [5] │ ETTJ Inflação Implicita (IPCA)
    #>  [6] │ Vertices;ETTJ IPCA;ETTJ PREF;Inflação Implícita
    #>  [7] │ 126;7,0470;10,4642;3,1922
    #>  [8] │ 252;6,3639;10,8168;4,1864
    #>  [9] │ 378;6,3456;11,1196;4,4891
    #> [10] │ 504;6,4010;11,3741;4,6739
    #> [11] │ 630;6,4417;11,5837;4,8308
    #> [12] │ 756;6,4635;11,7533;4,9686
    #> [13] │ 882;6,4725;11,8883;5,0865
    #> [14] │ 1.008;6,4735;11,9936;5,1844
    #> [15] │ 1.134;6,4696;12,0742;5,2640
    #> [16] │ 1.260;6,4627;12,1341;5,3271
    #> [17] │ 1.386;6,4542;12,1773;5,3761
    #> [18] │ 1.512;6,4447;12,2068;5,4132
    #> [19] │ 1.638;6,4349;12,2254;5,4404
    #> [20] │ 1.764;6,4249;12,2353;5,4596
    #> ... and 107 more
    
    }else{
      resp_status_desc(resp)
    }