Search code examples
rloopsreturnrvest

R : Returning a value from loop when manually stopping it


I am trying to build a database with rvest. Since I have much data to download, I tried to write several functions that would allow me to interrupt the scraping process and to restart it where I left it. However, while the functions work more or less, whenever I manually interrupt them, I loose the output. Does anyone know a solution that would allow me to stop the function without loosing the dataframe that the loop is building ? I would be glad for any advice!

Some urls that I am trying to scrape data from:

to_do <- c("https://jobs.51job.com/shenzhen-nsq/116924235.html?s=01&t=0",
          "https://jobs.51job.com/shenzhen-nsq/116923692.html?s=01&t=0",
          "https://jobs.51job.com/shenzhen-nsq/116923628.html?s=01&t=0",
          "https://jobs.51job.com/shenzhen-nsq/116923578.html?s=01&t=0",
          "https://jobs.51job.com/shenzhen-nsq/116920896.html?s=01&t=0")

The functions I created for downloading:

# In order to initiate the dowload
dl_data_start <- function(to_do){
  output <- tibble()
  i = 1
  while (to_do[i] %in% to_do) {
      page <- read_html(to_do[i])
      position <- page %>%
        html_nodes(.,'h1') %>%
        html_text(.)
      resume <- page %>%
        html_nodes(.,'.ltype') %>%
        html_text(.)
      job_offer <- page %>%
        html_nodes(.,'.job_msg') %>%
        html_text(.)
      eps <- page %>%
        html_nodes(.,'.com_msg') %>%
        html_text(.)
      eps_status <- page %>%
        html_nodes(.,'.at:nth-child(1)') %>%
        html_text(.)
      eps_description <- page %>%
        html_nodes(.,'.tmsg') %>%
        html_text(.)
      employees <- page %>%
        html_nodes(.,'.at:nth-child(2)') %>%
        html_text(.)
      category <- page %>%
        html_nodes(.,'.at:nth-child(3)') %>%
        html_text(.)
      salary <- page %>%
        html_nodes(.,'.cn strong') %>%
        html_text(.)
      url <- to_do[i]
      id <- i
      current <- tibble(position,resume,job_offer,eps,eps_description,eps_status,
                        employees,category,salary,url,id)
      output <- bind_rows(output,current)
      print(output[i,])
      i = i + 1
  }
  return(output)
}
# function in order to continue the download where I left it
dl_data_continue <- function(to_do,df,done){
  i = (match(tail(done,n=1),to_do) + 1)
  while (to_do[i] %in% to_do) {
    page <- read_html(to_do[i])
    position <- page %>%
      html_nodes(.,'h1') %>%
      html_text(.)
    resume <- page %>%
      html_nodes(.,'.ltype') %>%
      html_text(.)
    job_offer <- page %>%
      html_nodes(.,'.job_msg') %>%
      html_text(.)
    eps <- page %>%
      html_nodes(.,'.com_msg') %>%
      html_text(.)
    eps_status <- page %>%
      html_nodes(.,'.at:nth-child(1)') %>%
      html_text(.)
    eps_description <- page %>%
      html_nodes(.,'.tmsg') %>%
      html_text(.)
    employees <- page %>%
      html_nodes(.,'.at:nth-child(2)') %>%
      html_text(.)
    category <- page %>%
      html_nodes(.,'.at:nth-child(3)') %>%
      html_text(.)
    salary <- page %>%
      html_nodes(.,'.cn strong') %>%
      html_text(.)
    url <- to_do[i]
    id <- i
    current <- tibble(position,resume,job_offer,eps,eps_description,eps_status,
                      employees,category,salary,url,id)
    df <- bind_rows(df,current)
    print(df[i,])
    i = i + 1
  }
  return(df)
}

The issue I have is that whenever an I interrupt the loop or when an error occurs, I loose all the data. Could anybody solve this problem ? I tried several things such as safely or tryCatch but I can't get my head around what's wrong here. Thank you very much.

Edit : I also made some attemps with tryCatch. Using the below function, the code no longer breaks whenever it encounters a problem (e.g. HTTP 404 error). However when there's an error, the loop will remain stuck in the problematic iteration, so I must use it wrong.

dl_data_continue_2 <- function(to_do,df,done){
  i = (match(tail(done,n=1),to_do) + 1)
  while (to_do[i] %in% to_do) {
    tryCatch(
      {expr =
        page <- read_html(to_do[i])
      position <- page %>%
        html_nodes(.,'h1') %>%
        html_text(.)
      resume <- page %>%
        html_nodes(.,'.ltype') %>%
        html_text(.)
      job_offer <- page %>%
        html_nodes(.,'.job_msg') %>%
        html_text(.)
      eps <- page %>%
        html_nodes(.,'.com_msg') %>%
        html_text(.)
      eps_status <- page %>%
        html_nodes(.,'.at:nth-child(1)') %>%
        html_text(.)
      eps_description <- page %>%
        html_nodes(.,'.tmsg') %>%
        html_text(.)
      employees <- page %>%
        html_nodes(.,'.at:nth-child(2)') %>%
        html_text(.)
      category <- page %>%
        html_nodes(.,'.at:nth-child(3)') %>%
        html_text(.)
      salary <- page %>%
        html_nodes(.,'.cn strong') %>%
        html_text(.)
      url <- to_do[i]
      id <- i
      current <- tibble(position,resume,job_offer,eps,eps_description,eps_status,
                        employees,category,salary,url,id)
      df <- bind_rows(df,current)
      print(df[i,])
      i = i + 1},
      error = function(e){
        message("* Caught an error on itertion ")
        print(e)
        i = i + 1
      }
    )
  }
  out
}

Using safely, I basically tried

library(purrr)
dl_safely <- safely(dl_data_continue)

Solution

  • I come across this problem often in webscraping. The key is to store the intermediate results in an environment where they are accessible if your function throws an error. The obvious place is the global environment, but this depends on how you are using your function. If it is part of a package, then you don't want to write to the global workspace. In that case you can have a "storage" environment as part of the package.

    Perhaps the neatest way to do this is to delete the intermediate object after the loop is complete, so it will only ever be visible / accessible if the loop throws an error.

    Here is a function that demonstrates the principle:

    write_data_frames <- function(n)
    {
      if(!exists("temporary", .GlobalEnv))
      {
        assign("temporary", list(), envir = globalenv())
        i <- 1
      }
      else
      {
        i <- length(.GlobalEnv$temporary) + 1
      }
    
      while(i <= n)
      {
        # This is the block where you do your web scraping and store the result
        .GlobalEnv$temporary[[i]] <- data.frame(var1 = rnorm(1), var2 = runif(1))
    
        # We'll create an error when i == 4
        if(i == 4) stop("Something broke!")
        i <- i + 1
      }
      result <- do.call(rbind, temporary)
      rm("temporary", envir = globalenv())
      return(result)
    }
    

    Now, this should return a nice data frame if I ask it for 3 rows:

    write_data_frames(3)
    #>         var1      var2
    #> 1 -1.6428100 0.1976913
    #> 2  0.7136643 0.9684348
    #> 3 -0.4845004 0.0294557
    

    And it hasn't left anything in our global workspace:

    ls()
    #> [1] "write_data_frames"
    

    But suppose I ask for ten rows: here, it will throw an error on the fourth loop:

    write_data_frames(10)
    #> Error in write_data_frames(10) : Something broke!
    

    However, this time, the object temporary is available to me:

    ls()
    #> [1] "temporary"         "write_data_frames"
    
    temporary
    #> [[1]]
    #>       var1      var2
    #> 1 -1.46648 0.1748874
    #> 
    #> [[2]]
    #>          var1      var2
    #> 1 -0.03855686 0.5772731
    #> 
    #> [[3]]
    #>        var1      var2
    #> 1 0.8228591 0.4115181
    #> 
    #> [[4]]
    #>        var1      var2
    #> 1 0.9183934 0.2732575
    

    Even better, my function is designed to just carry on from where it left off, so if I once again do

    write_data_frames(10)
    #>           var1      var2
    #> 1  -1.46647987 0.1748874
    #> 2  -0.03855686 0.5772731
    #> 3   0.82285907 0.4115181
    #> 4   0.91839339 0.2732575
    #> 5   0.54850658 0.9946303
    #> 6  -1.39917426 0.9948544
    #> 7   0.39525152 0.9234611
    #> 8  -1.05899076 0.6226182
    #> 9  -2.03137464 0.1218762
    #> 10  0.24880216 0.6631982
    

    The function has started again from position 5 without any modification. And now, when we check our global workspace, there's nothing left:

    ls()
    #> [1] "write_data_frames"