Search code examples
rweb-scrapingrvesthttp-status-code-504

Many Errors when Web-Scraping with splashr including Error in execute_lua(splash_obj, call_function) : Gateway Timeout (HTTP 504)


I'm trying to use splashr to scrape a dynamic webpage, and it's been non-stop problems for me. During my scrape of get_box_score(), I'll either get the errors

Error in execute_lua(splash_obj, call_function) : 
Gateway Timeout (HTTP 504).

or

Error in UseMethod("html_table") : 
no applicable method for 'html_table' applied to 
an object of class "xml_missing"

And honestly, once I "fix" one of the errors, I get the other. I have no idea if these are related, or if I'm just getting a lot of different unrelated errors with my code. Any idea how I can fix these? Here's my code:

library(tidyverse)
library(splashr)
library(rvest)

url <- "https://www.uscho.com/scoreboard/michigan/mens-hockey/"  

# Everything should be fine for a while
get_data <- function(myurl) {

  link_data <- myurl %>%
    read_html() %>%
    html_nodes("td:nth-child(13) a") %>%
    html_attr("href") %>%
    str_c("https://www.uscho.com", .) %>%
    as_tibble() %>%
    set_names("url")

  game_type <- myurl %>%
    read_html() %>%
    html_nodes("td:nth-child(12)") %>%
    html_text() %>%
    as_tibble() %>%
    set_names("game_type") %>%
    filter(game_type != "Type")

  as_tibble(data.frame(link_data, game_type))

}

link_list <- get_data(url)

urls <- link_list %>%
  filter(game_type != "EX") %>%
  pull(url)

# Here's where the fun starts
get_box_score <- function(my_url) {

  progress_bar$tick()$print()
  Sys.sleep(15)
  splash_container <- start_splash()
  on.exit(stop_splash(splash_container))
  Sys.sleep(10)

  mydata <- splash_local %>%
    splash_response_body(TRUE) %>%
    splash_user_agent(ua_win10_chrome) %>%
    splash_go(my_url) %>%
    splash_wait(runif(1, 5, 10)) %>%
    splash_html() %>%
    html_node("#boxgoals") %>%
    html_table(fill = TRUE) %>%
    as_tibble()

  return(mydata)
}

progress_bar <- link_list %>%
  filter(game_type != "EX") %>%
  tally() %>%
  progress_estimated(min_time = 0)

mydata <- pmap_df(list(urls), get_box_score)

Solution

  • There is nothing wrong with your code. 504 is a server-side error; it the server cannot handle the requests in time.

    But you can still fix it by adjusting your code; you can try the following methods:

    • Slow down your request, use Sys.sleep() to pause between each request, if you make requests too fast, the server may not be able to handle it and may consider you like a robot and ban you. As a result, you will receive 504 error.

    • use try() or tryCatch() function to skip errors and prevent them from breaking your loop. You can also write code to automatically try to make another request if the request failed.