My problem is that I have to change by hand the "pages_to_scrape" integer vector each time I scrape different parts of a given static site for which I dont know the exact page number. So, I want to automate this. In other words, when I dont know the page number beforehand, I want to be able to scrape all pages that are available.
library(rvest)
base_url <- "https://morskidar.bg/products.php?category=pryasna-riba&page=%d"
scrape_prices <- function(page) {
url <- sprintf(base_url, page)
page_content <- read_html(url)
pc <- page_content %>%
html_elements(".col-sm-6") %>%
map_dfr(~ tibble(
product = .x %>%
html_element(".shop-three-products-name a") %>%
html_text2(),
price = .x %>%
html_element(".shop-three-products-price") %>%
html_text2(),
)) %>%
mutate(date = Sys.Date(),
location = "Unknown",
type = "Unknown",
source = "Unknown", .before = product) %>%
separate_wider_delim(price, delim = " - ", names = c("unit", "price")) %>%
mutate(price = parse_number(price), unit = str_remove(unit, "\\.")) %>%
distinct()
return(pc)
}
pages_to_scrape <- 1:5
final_df <- map_dfr(pages_to_scrape, scrape_prices)
As pagination links / page numbers are contained in the same html document:
<ul class="pagination">
<li class="active"><a href="/products.php?category=pryasna-riba&page=1" title="Pagination">1</a></li>
<li><a href="/products.php?category=pryasna-riba&page=2" title="Pagination">2</a></li>
<li><a href="/products.php?category=pryasna-riba&page=3" title="Pagination">3</a></li>
<li><a href="/products.php?category=pryasna-riba&page=4" title="Pagination">4</a></li>
<li><a href="/products.php?category=pryasna-riba&page=5" title="Pagination">5</a></li>
</ul>
, you can first scarpe values for your pages_to_scrape
vector:
library(rvest)
base_url <- "https://morskidar.bg/products.php?category=pryasna-riba&page=%d"
pages_to_scrape <-
sprintf(base_url, 1) |>
read_html() |>
html_elements("ul.pagination li") |>
html_text() |>
as.integer()
pages_to_scrape
#> [1] 1 2 3 4 5
Created on 2024-01-04 with reprex v2.0.2