Search code examples
rweb-scrapingrvest

R: Error in `html_form_submit()`: `form` doesn't contain a `action` attribute


I'm trying to automate downloading of the data contained here: https://www.offenerhaushalt.at/gemeinde/innsbruck/download

enter image description here

I can fairly easily specify the form, either through the url in the way: https://www.offenerhaushalt.at/gemeinde/innsbruck/download?year=2022&haushalt=fhh&rechnungsabschluss=va&origin=gemeinde

Or through the rvest function html_form(), but I cannot download the form as the html_form_submit() throws the error:

Error in `submission_build()`:
! `form` doesn't contain a `action` attribute
library(rvest)
library(tidyverse)
html_form(read_html("https://www.offenerhaushalt.at/gemeinde/innsbruck/download"))[[1]] %>% 
    html_form_set(year = "2022", 
                  haushalt = "fhh",
                  rechnungsabschluss = "va",
                  origin = "gemeinde") %>% 
    html_form_submit()

Any ideas on how to capture the file that is generated afterwards and download it?

It seems to me that it sends the "action" to a url that looks like: https://www.offenerhaushalt.at/downloads/ghdByParams

But I'm not sure what to do with that.

Thanks all!


Solution

  • You can manually set the action url for that form:

    library(rvest)
    library(purrr)
    dl_url <- "https://www.offenerhaushalt.at/gemeinde/innsbruck/download"
    
    sess <- session(dl_url)
    form <- sess %>% read_html() %>% html_form() %>% .[[1]]
    
    # list valid options for select boxes
    map(form$fields, "options") %>% keep(~ length(.x) > 0) %>% 
      imap_dfr(~ list(field = .y, options = paste(.x, collapse = " ")))
    #> # A tibble: 4 × 2
    #>   field              options                                                    
    #>   <chr>              <chr>                                                      
    #> 1 haushalt           default fhh ehh vhh                                        
    #> 2 rechnungsabschluss default ra va                                              
    #> 3 year               default 2022 2021 2020 2019 2018 2017 2016 2015 2014 2013 …
    #> 4 origin             default statistik_at gemeinde
    
    # set values
    form$fields$haushalt$value <- "fhh"
    form$fields$rechnungsabschluss$value <- "ra"
    form$fields$year$value <- "2020"
    form$fields$origin$value <- "statistik_at"
    
    # manually set form method & action
    form$method <- "POST"
    form$action <- "https://www.offenerhaushalt.at/downloads/ghdByParams"
    
    # submit form
    sess <- session_submit(sess, form)
    
    # response headers
    imap_dfr(sess$response$headers, ~ list(header = .y, value = .x))
    #> # A tibble: 10 × 2
    #>    header              value                                                    
    #>    <chr>               <chr>                                                    
    #>  1 date                Sat, 21 Jan 2023 01:47:13 GMT                            
    #>  2 server              Apache                                                   
    #>  3 content-disposition attachment; filename=offenerhaushalt_70101_2020_ra_fhh.c…
    #>  4 pragma              no-cache                                                 
    #>  5 cache-control       must-revalidate, post-check=0, pre-check=0, private      
    #>  6 expires             0                                                        
    #>  7 set-cookie          XSRF-TOKEN=eyJpdiI6IjdHd2pSakwzV09xb3Jab05zXC81em1RPT0iL…
    #>  8 set-cookie          offener_haushalt_session=eyJpdiI6IjI5cUN5MGhCSmVadmN5enV…
    #>  9 transfer-encoding   chunked                                                  
    #> 10 content-type        text/csv; charset=UTF-8
    
    # parse attached CSV
    httr::content(sess$response, as = "text") %>% readr::read_csv2()
    #> ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
    #> Rows: 1408 Columns: 11
    #> ── Column specification ────────────────────────────────────────────────────────
    #> Delimiter: ";"
    #> chr (8): ansatz_uab, ansatz_ugl, konto_grp, konto_ugl, sonst_ugl, vorhabenco...
    #> dbl (2): mvag, wert
    #> lgl (1): verguetung
    #> 
    #> ℹ Use `spec()` to retrieve the full column specification for this data.
    #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
    #> # A tibble: 1,408 × 11
    #>    ansat…¹ ansat…² konto…³ konto…⁴ sonst…⁵ vergu…⁶ vorha…⁷  mvag ansat…⁸ konto…⁹
    #>    <chr>   <chr>   <chr>   <chr>   <chr>   <lgl>   <chr>   <dbl> <chr>   <chr>  
    #>  1 000     000     042     000     000     NA      0000000  3415 Gewähl… Amts-,…
    #>  2 000     000     070     000     000     NA      0000000  3411 Gewähl… Aktivi…
    #>  3 000     000     400     000     000     NA      0000000  3221 Gewähl… Gering…
    #>  4 000     000     413     000     000     NA      0000000  3221 Gewähl… Handel…
    #>  5 000     000     456     000     000     NA      0000000  3221 Gewähl… Schrei…
    #>  6 000     000     457     000     000     NA      0000000  3221 Gewähl… Druckw…
    #>  7 000     000     459     000     000     NA      0000000  3221 Gewähl… Sonsti…
    #>  8 000     000     618     000     000     NA      0000000  3224 Gewähl… Instan…
    #>  9 000     000     621     000     000     NA      0000000  3222 Gewähl… Sonsti…
    #> 10 000     000     631     000     000     NA      0000000  3222 Gewähl… Teleko…
    #> # … with 1,398 more rows, 1 more variable: wert <dbl>, and abbreviated variable
    #> #   names ¹​ansatz_uab, ²​ansatz_ugl, ³​konto_grp, ⁴​konto_ugl, ⁵​sonst_ugl,
    #> #   ⁶​verguetung, ⁷​vorhabencode, ⁸​ansatz_text, ⁹​konto_text
    

    As rvest accepts and passes on httr configs, attached files can be saved directly too:

    dest_file <- tempfile(fileext = ".csv")
    session_submit(sess, form, submit = NULL, httr::write_disk(dest_file))
    # browseURL(dirname(dest_file))