Search code examples
htmlrrvestxml2

Downloading a dynamic file from html node with R


So, I have the following script:

library(rvest)
library(xml2)

DOES <- session("https://ioes.dio.es.gov.br/portal/visualizacoes/diario_oficial")
DOES <-read_html(DOES)
x1b6 <- xml_find_all(DOES, "//a[@id='baixar-diario-completo']")
x1b6
{xml_nodeset (1)}
[1] <a href="/portal/edicoes/download/0" id="baixar-diario-completo">\n                        <img src=""  ...

It's the official journal from my local government. I'm trying to download a file in the xpath= html//body//div[2]//div[1]//div[1]//div[1]//div[1]//div[1]//a

The file changes everyday with a new journal edition, so I'm trying to create an extraction routine to download the file automatically everyday. When I inspect the element through Chrome, it generates the right daily href: https://ioes.dio.es.gov.br/portal/edicoes/download/7620 But in the code above, as you can see, the href ends with 0. How can I get the right path?


Solution

  • I propose this solution. Simply supply the function with a date and the PDF will be downloaded to your environment automatically.

    library(tidyverse)
    library(httr2)
    
    get_file <- function(date) {
      str_c("https://ioes.dio.es.gov.br/apifront/portal/edicoes/edicoes_from_data/", date, 
            ".json?&subtheme=false") %>%
        request() %>%
        req_perform() %>%
        resp_body_json(simplifyVector = TRUE) %>%
        getElement("itens") %>%
        pull(id) %>% 
        str_c("https://ioes.dio.es.gov.br/portal/edicoes/download/", .) %>% 
        download.file(., mode = "wb", 
                      destfile = str_c(date, ".pdf"))
    }
    
    get_file("2022-11-30")
    get_file(lubridate::today())