Search code examples
rfor-loopweb-scrapingrvest

received Error in open.connection(x, "rb") : HTTP error 404. after running a for-loop in r


While trying to scrape information from several links, I got the error: Error in open.connection(x, "rb") : HTTP error 404.

I feel like it has something to do with the first part of my for-loop, so I tried changing numbers from character to numeric, but that did not fix the problem. I also tried advice here, however, it returned more problems.

Think you can spot where I went wrong?

library(rvest)
library(tidyverse)

pageMen = read_html('https://www.bjjcompsystem.com/tournaments/1869/categories')
get_links <- pageMen %>% 
  html_nodes('.categories-grid__category a') %>% 
  html_attr('href') %>%
  paste0('https://www.bjjcompsystem.com', .) 

# extract numerical part of link
numbers = str_sub(get_links, - 7, - 1)  
numbers = as.numeric(numbers)

## create empty vector  ----------------------------
master1.tree = data.frame()

## Create for loop ---------------------------------
for (i in length(numbers)){
  url <- read_html(paste0('https://www.bjjcompsystem.com/tournaments/1869/categories/', i))
  
ageDivision <- url %>% html_nodes('.category-title__age-division') %>% html_text()

gender <- url %>% html_nodes('.category-title__age-division+ .category-title__label') %>% html_text()  

matches = data.frame('division' = ageDivision,'gender' = gender)
master1.tree <- rbind(master1.tree, data.frame(matches))
}

I also ran this, but it did not return the data frame for the scraped data. Instead it printed the results on the screen instead

map_df(get_links, function(i){
  url <- read_html(i)
  
matches <- data.frame(ageDivision <- url %>% 
  html_nodes('.category-title__age-division') %>% html_text(),
gender <- url %>% html_nodes('.category-title__age-division+ .category-title__label') %>% html_text() ) 

master1.tree <- rbind(master1.tree, matches)
})

Solution

  • Here is an alternative to your code. First, it's not necessary to extract the numbers. You can directly loop over the vector get_links. Second, I use purrr::map_df for the looping part which is a more concise way than using the for loop. To this end I use a custom function to scrape one of your pages. Finally, I use trim=TRUE with html_text to remove the leading and trailing white space:

    library(rvest)
    library(tidyverse)
    
    pageMen = read_html('https://www.bjjcompsystem.com/tournaments/1869/categories')
    
    get_links <- pageMen %>% 
      html_nodes('.categories-grid__category a') %>% 
      html_attr('href') %>%
      paste0('https://www.bjjcompsystem.com', .)
    
    scrape_page <- function(url) {
      html <- read_html(url)
      data.frame(
        division = html %>% html_nodes('.category-title__age-division') %>% html_text(trim = TRUE),
        gender = html %>% html_nodes('.category-title__age-division+ .category-title__label') %>% html_text(trim = TRUE)
      )
    }
    
    master1.tree <- purrr::map_df(get_links[1:5], scrape_page)
    
    master1.tree
    #>   division gender
    #> 1 Master 1   Male
    #> 2 Master 1   Male
    #> 3 Master 1   Male
    #> 4 Master 1   Male
    #> 5 Master 1   Male