Search code examples
rphantomjsv8rvest

How to scrape this website in R using rvest?


I’m trying to scrape this website using RVest: https://www.camara.cl/legislacion/sesiones_sala/sesiones_sala.aspx

Notice that the site loads quickly, but the data takes some time to appear. I realized that, while the content appears as html text in a web browser Inspector, the nodes appear empty when scraped using rvest.

library(dplyr)
library(rvest)

camara <- "https://www.camara.cl/legislacion/sesiones_sala/sesiones_sala.aspx" %>% 
  session()


camara %>% 
  html_elements("h2")

camara %>% 
  html_elements(".box-proyecto") 

camara %>% 
  html_elements("#trabajo-en-sala") %>% 
  html_elements("#info-tabs") %>% 
  html_elements("#ajax-container") %>% 
  html_elements("pnlTablaOrdinaria")

All of these should return at least some text content, but they appear empty.

I tried using V8 to interpret javascript according to these instructions, but the site appears to use JS only for interface elements, not for data retrieval.

I also tried to run it through PhantomJS following these instructions, but couldn’t run the script due to permission issues.

It seems that I need to perform a GET request for the data, but the URL I found on the site’s code returns nothing: https://www.camara.cl/legislacion/sesiones_sala/tabla.aspx?_=1628291424652

I can’t use RSelenium as I’m working remotely through a headless server.


Solution

  • You need to pick up a session cookie (ASP.NET_SessionId) from the initial url. You could use session for this, for example:

    library(rvest)
    library(magrittr)
    
    r <- session('https://www.camara.cl/legislacion/sesiones_sala/sesiones_sala.aspx') %>% 
      session_jump_to('https://www.camara.cl/legislacion/sesiones_sala/tabla.aspx')
    
    tables <- r %>% read_html() %>% html_table()