Search code examples
rweb-scraping

Way to web-scrape a popular eSport website using R?


I'm attempting to webscrape the following url to obtain live game data: https://egamersworld.com/callofduty/matches I've attempted to inspect the fetch requests being made, but there isn't an obvious request that's returning json formatted data with the page info.

Additionally, I'm getting an error 403 forbidden response when attempting to access the site using R. I've also attempted replicating the request headers, still no luck.

I'm no web scraping professional, and I'm curious if this website has some additional steps I need to be performing. Or if they have measures in place that i'm unaware of.

Here is my R code that I've attempted. I've attempted many different headers and header combinations (all of which result in 403).

Note: I modified the Accept-Encoding header from "gzip, deflate, br, zstd" as having "br, zstd" present gives the error:

"Error in curl::curl_fetch_memory(url, handle = handle) : Unrecognized content encoding type. libcurl understands deflate, gzip content encodings."

library(httr)

url <- "https://egamersworld.com/callofduty/matches"

headers = add_headers("Accept" = "*/*",
                      "Accept-Encoding" = "gzip, deflate",
                      "Accept-Language" = "en-US,en;q=0.9",
                      "Referer" = "https://egamersworld.com/matches",
                      "User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36")

response = GET(url, headers)

response$status_code
# returns 403

Solution

  • Here is an rvest approach alongside selenider as browser, where we pull data directly from the html:

    library(selenider)
    library(rvest)
    
    session <- selenider_session("selenium", browser = "chrome")
    open_url("https://egamersworld.com/callofduty/matches")
    
    elements <- session |> get_page_source() |> html_elements(".item_teams__cKXQT")
    
    res <- data.frame(
      home_team_name = elements |> 
        html_elements(".item_team__evhUQ:nth-child(1) .item_teamName__NSnfH") |> 
        html_text(trim = TRUE),
      home_team_odds = elements |> 
        html_elements(".item_team__evhUQ:nth-child(1) .item_odd__Lm2Wl") |> 
        html_text(trim = TRUE),
      away_team_name = elements |> 
        html_elements(".item_team__evhUQ:nth-child(3) .item_teamName__NSnfH") |> 
        html_text(trim = TRUE),
      away_team_odds = elements |> 
        html_elements(".item_team__evhUQ:nth-child(3) .item_odd__Lm2Wl") |> 
        html_text(trim = TRUE),
      match_date = elements |> 
        html_elements(".item_scores__Vi7YX .item_date__g4cq_") |> 
        html_text(trim = TRUE),
      match_time = elements |> 
        html_elements(".item_scores__Vi7YX .item_time__xBia_") |> 
        html_text(trim = TRUE),
      match_type = elements |> 
        html_elements(".item_scores__Vi7YX .item_bo__u2C9Q") |> 
        html_text(trim = TRUE)
    )
    

    giving the 20 results available on your page

    home_team_name home_team_odds away_team_name away_team_odds match_date match_time match_type
    Noctem Esports 1.8 Project 7 Esports 1.8 05.03.25 20:00 Bo5
    Annex Esports 1.8 Team Notorious 1.8 05.03.25 20:00 Bo5
    DIZWRLD 1.8 AVNG Esports 1.8 05.03.25 20:00 Bo5
    Inglorious Gaming 1.8 Notorious Gaming 1.8 05.03.25 22:00 Bo5
    Clutch Rayn Esport 1.8 Katana Gaming 1.8 05.03.25 22:00 Bo5
    Rauzan Esport 1.8 Team Bance 1.8 05.03.25 22:00 Bo5
    OMiT Brooklyn 1.8 Team WaR 1.8 06.03.25 00:30 Bo5
    YFP 1.8 Kansas City Pioneers 1.8 06.03.25 00:30 Bo5
    6F Carolina 1.8 Pinnacle 1.8 06.03.25 00:30 Bo5
    OMiT Brooklyn 1.8 Destro Gaming 1.8 06.03.25 02:00 Bo5
    Luxury Exotics 2.627 CABAL Gaming 1.454 06.03.25 02:00 Bo5
    Royal Spartans 2 Lore Gaming 1.727 06.03.25 02:00 Bo5
    Vancouver Surge 2.87 Los Angeles Thieves 1.41 07.03.25 21:00 Bo5
    Toronto Ultra 1.03 Vegas Falcons 9.07 07.03.25 22:30 Bo5
    Cloud9 New York 1.8 Los Angeles Guerrillas M8 1.8 08.03.25 00:00 Bo5
    Atlanta FaZe 1.23 Minnesota RØKKR 3.82 08.03.25 21:00 Bo5
    Cloud9 New York 1.89 Carolina Royal Ravens 1.79 08.03.25 22:30 Bo5
    Los Angeles Thieves 1.06 Boston Breach 7 09.03.25 00:00 Bo5
    OpTic Texas 1.85 Miami Heretics 1.85 09.03.25 01:30 Bo5
    Vegas Falcons 1.8 Los Angeles Guerrillas M8 1.8 09.03.25 21:00 Bo5

    In case any of these objects are renamed, you can also try the following

    library(selenider)        
    session <- selenider_session("selenium", browser = "chrome")
    open_url("https://egamersworld.com/callofduty/matches")
    
    elements <- session |>
      find_elements(".item_teams__cKXQT") |>
      as.list()
    
    res <- do.call(rbind, lapply(elements, function(x) {
      matrix(strsplit(elem_text(x), "\n")[[1]], nrow = 1)
    })) |> as.data.frame()
    

    Using Browser Tools (F12) you only need the name of

    out


    Listening to the Websocket

    If you want, you can fetch realtime match data using the Websocket like described here

    library(websocket)
    library(jsonlite)
    
    ws <- websocket::WebSocket$new("wss://ws.egamersworld.com/socket.io/?EIO=3&transport=websocket", autoConnect = FALSE)
    
    all_messages <- list()
    
    ws$onOpen(function(event) {  cat("Connection opened\n")})
    ws$onError(function(event) {  cat("Error occurred:\n") ; print(event) })
    
    ws$onMessage(function(event) {
      cat("Message received\n")
      all_messages <<- c(all_messages, list(event$data))
      cat("Messages collected:", length(all_messages), "\n")
    })
    
    ws$onClose(function(event) {
      cat("Connection closed. Saving data to json...\n")
      
      output_file <- paste0("egamersworld_data_", format(Sys.time(), "%Y%m%d_%H%M%S"), ".json")
      writeLines(toJSON(all_messages, auto_unbox = TRUE), output_file)
      
      cat("All messages saved to:", output_file, "\n")
    })
    
    
    ws$connect() # Connect and listen
    ws$close() # close