Search code examples
rweb-scrapingrvestjsonlitechromote

Parsing issues when scraping


I'm having trouble with the code below. The function test is used to get data from a website and works pretty well for all values of i from 2 to 33000 (no matter). But when it comes to get all the pages with my loop, I get parsing errors and multiple identical lines in my dataframe.

library(rvest)
library(chromote)
library(jsonlite)
library(dplyr)


test=function(i){
  b <- ChromoteSession$new()
  p=b$Page$loadEventFired(wait_ = FALSE)
  b$Page$navigate(paste("https://www.ecologie.gouv.fr/sru_api/api/towns/",i,sep=""),wait_ = FALSE)
  b$wait_for(p)
  html <- b$Runtime$evaluate('document.documentElement.outerHTML')
  content <- read_html(html$result$value)
  data_json=html_text(content)
  df=fromJSON(data_json)
  return(df)}



ma_liste <- list()
n=100
for (i in 2:n){
  tryCatch({
    ma_liste <- c(ma_liste, list(test(i)))
  })
}

ma_liste
dataframe <- do.call(rbind, ma_liste)
dataframe <- as.data.frame(dataframe)

I tried to ignore the problematic lines with tryCatch but it doesn't fix the issue of multiple lines (and skips a lot of data). Can you help me on this ? Thanks.


Solution

  • The problem persist on my own computer. Since the connection was the problem, I demanded that the loop tries again for every failed iteration with trycatch and it works fine for me. I conclude that my problem is my proxy/firewall or something independent from the code you will all be able to provide me with. Now remains the problem of the speed of execution but that is less of a matter to me.

    library(rvest)
    library(chromote)
    library(jsonlite)
    library(dplyr)
    library(progress)
    
    test <- function(i) {
      b <- ChromoteSession$new()
      p <- b$Page$loadEventFired(wait_ = FALSE)
      b$Page$navigate(paste("https://www.ecologie.gouv.fr/sru_api/api/towns/", i, sep = ""), wait_ = FALSE)
      b$wait_for(p)
      html <- b$Runtime$evaluate('document.documentElement.outerHTML')
      content <- read_html(html$result$value)
      data_json <- html_text(content)
      df <- fromJSON(data_json)
      b$close()
      return(df)
    }
    
    start.time <- Sys.time()
    ma_liste <- list()
    n <- 100
    pb <- progress_bar$new(total = n)
    for (i in 2:n) {
      pb$tick()
      retry <- TRUE
      while (retry) {
        tryCatch({
          ma_liste <- c(ma_liste, list(test(i)))
          retry <- FALSE  # Pas d'erreur, donc pas besoin de réessayer
        }, error = function(e) {
          message("", i, ": ", conditionMessage(e))
          Sys.sleep(0.001)  # Attendre un certain temps avant de réessayer
        })
      }
    }
    
    dataframe <- do.call(rbind, ma_liste)
    dataframe <- as.data.frame(dataframe)
    end.time <- Sys.time()
    time.taken <- round(end.time - start.time,2)
    time.taken