Search code examples
rweb-scrapingrvest

How to automate page number change in web scraping?


My problem is that I have to change by hand the "pages_to_scrape" integer vector each time I scrape different parts of a given static site for which I dont know the exact page number. So, I want to automate this. In other words, when I dont know the page number beforehand, I want to be able to scrape all pages that are available.

library(rvest)

base_url <- "https://morskidar.bg/products.php?category=pryasna-riba&page=%d"
scrape_prices <- function(page) {
  url <- sprintf(base_url, page)
  page_content <- read_html(url)
  pc <- page_content %>% 
    html_elements(".col-sm-6") %>%
    map_dfr(~ tibble(
      product = .x %>% 
        html_element(".shop-three-products-name a") %>% 
        html_text2(), 
      price = .x %>% 
        html_element(".shop-three-products-price") %>% 
        html_text2(),
      )) %>% 
    mutate(date = Sys.Date(),
               location = "Unknown",
               type = "Unknown",
               source = "Unknown", .before = product) %>% 
    separate_wider_delim(price, delim = " - ", names = c("unit", "price")) %>% 
    mutate(price = parse_number(price), unit = str_remove(unit, "\\.")) %>% 
    distinct()
  return(pc)
  }
pages_to_scrape <- 1:5
final_df <- map_dfr(pages_to_scrape, scrape_prices)

Solution

  • As pagination links / page numbers are contained in the same html document:

    <ul class="pagination">
      <li class="active"><a href="/products.php?category=pryasna-riba&amp;page=1" title="Pagination">1</a></li>
      <li><a href="/products.php?category=pryasna-riba&amp;page=2" title="Pagination">2</a></li>
      <li><a href="/products.php?category=pryasna-riba&amp;page=3" title="Pagination">3</a></li>
      <li><a href="/products.php?category=pryasna-riba&amp;page=4" title="Pagination">4</a></li>
      <li><a href="/products.php?category=pryasna-riba&amp;page=5" title="Pagination">5</a></li>
    </ul>
    

    , you can first scarpe values for your pages_to_scrape vector:

    library(rvest)
    base_url <- "https://morskidar.bg/products.php?category=pryasna-riba&page=%d"
    pages_to_scrape <- 
      sprintf(base_url, 1) |>
      read_html() |>
      html_elements("ul.pagination li") |> 
      html_text() |>
      as.integer()
    
    pages_to_scrape
    #> [1] 1 2 3 4 5
    

    Created on 2024-01-04 with reprex v2.0.2