Search code examples
rscreen-scrapingrvesthttr

Scraping a web page in R without using RSelenium


I’m trying to do a simple scrap in the table in the following url:

https://www.bcb.gov.br/controleinflacao/historicometas

Page Print

By what i notice is that, When using rvest::read_html or httr::GET and when acessing the page source code i can't see anything related to the table, but when acessing google chrome developer tools, i can spot the table references in the elements tab.

Examble above is a simple code where i try to acess the content of the url and search of nodes that contain tables:

library( tidyverse )
library( rvest )

url <- “https://www.bcb.gov.br/controleinflacao/historicometas”

res <- url %>%
    read_html( ) %>%
    html_node( “table” )

this give me:

{xml_nodeset (0)}

opening the source code of the url mentioned we can see:

view-source:https://www.bcb.gov.br/controleinflacao/historicometas

Page Source Code print

Page Developer Tool table print

By what i have searched the question is that the scripts avaible in source code load the table. I have seen some solutions that use RSelenium, but i would like to know if there is some solution where i can scrap this table without using Rselenium.

Some other related StackOverflow questions:

Scraping webpage (with R) where all elements are placed inside an <app-root> tag

scraping table from a website result as empty

(Last one is a python example)


Solution

  • When dealing with dynamic sites, Network tab tends to be more useful than Inspector. And often you don't have to scroll through hundreds of requests or pages of minified javascript, you rather pick a search term from rendered page to identify the api endpoint that sent that piece of information. In this case searching for "Resolução CMN nº 2.615" pointed to the correct call, most of the site content (in pure html) was delivered as json.

    library(tibble)
    library(rvest)
    
    historicometas <- jsonlite::read_json("https://www.bcb.gov.br/api/paginasite/sitebcb/controleinflacao/historicometas")
    historicometas$conteudo %>% 
      read_html() %>% 
      html_element("table") %>% 
      html_table()
    #> # A tibble: 27 × 7
    #>    Ano   Norma                             Data  Meta …¹ Taman…² Inter…³ Infla…⁴
    #>    <chr> <chr>                             <chr> <chr>   <chr>   <chr>   <chr>  
    #>  1 1999  Resolução CMN nº 2.615 ​           30/6… 8       2       6-10    8,94   
    #>  2 2000  Resolução CMN nº 2.615 ​           30/6… ​6       ​2       4-8     5,97   
    #>  3 2001  Resolução CMN nº 2.615 ​           30/6… ​4       ​2       2-6     7,67   
    #>  4 2002  Resolução CMN nº 2.744            28/6… 3,5     2       1,5-5,5 12,53  
    #>  5 2003* Resolução CMN nº 2.842Resolução … 28/6… 3,254   22,5    1,25-5… 9,309,…
    #>  6 2004* Resolução CMN nº 2.972Resolução … 27/6… 3,755,5 2,52,5  1,25-6… 7,60   
    #>  7 2005  Resolução CMN nº 3.108            25/6… 4,5     2,5     2-7     5,69   
    #>  8 2006  Resolução CMN nº 3.210            30/6… 4,5     ​2,0     2,5-6,5 3,14   
    #>  9 2007  Resolução CMN nº 3.291            23/6… 4,5     ​2,0     2,5-6,5 4,46   
    #> 10 2008  Resolução CMN nº 3.378            29/6… 4,5     ​2,0     2,5-6,5 5,90   
    #> # … with 17 more rows, and abbreviated variable names ¹​`Meta (%)`,
    #> #   ²​`Tamanhodo intervalo +/- (p.p.)`, ³​`Intervalode tolerância (%)`,
    #> #   ⁴​`Inflação efetiva(Variação do IPCA, %)`
    

    Created on 2022-10-17 with reprex v2.0.2