Search code examples
rweb-scrapingrselenium

Looping over multiple pages with RSelenium


I've managed to get RSelenium to work and so far I've been successful in extracting some data from the following website: https://www.immobiliare.it/vendita-case/belluno-provincia/?criterio=rilevanza

However, my problem is that I can't seem to make Selenium to move to the next pages.

library(RSelenium)
library(netstat)

rs_driver_obj <- rsDriver(browser = "chrome",
                          chromever = "126.0.6478.126",
                          verbose = FALSE,
                          port = free_port())

remDr <- rs_driver_obj$client
remDr$open()
remDr$navigate("https://www.immobiliare.it/vendita-case/belluno-provincia/?criterio=rilevanza")

remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
next_button <- remDr$findElement(using = "xpath", "//a[starts-with(@class,'in-pagination__item nd-button nd-button--ghost')]")
next_button$clickElement()

I get a weird error saying that it's a StaleElementReference. How would I go about looping over all the pages in the URL provided above?


Solution

  • Try this. I'm using a remote driver to connect to Selenium, but you can replace the call to remoteDriver() with one to rsDriver().

    library(RSelenium)
    
    URL <- "https://www.immobiliare.it/vendita-case/belluno-provincia/?criterio=rilevanza"
    
    driver <- remoteDriver(browserName = "chrome", port = 4444)
    
    driver$open()
    
    driver$navigate(URL)
    
    Sys.sleep(5)
    
    # Accept cookies & conditions.
    #
    accept <- tryCatch({
      suppressMessages(
        driver$findElement(using = "css selector", "#didomi-notice-agree-button")
      )
    }, error = function(e) {
      NULL
    })
    if (!is.null(accept)) accept$clickElement()
    
    while (TRUE) {
      # Scroll to bottom of main section.
      #
      main <- driver$findElement(using = "css selector", "main section")
      driver$executeScript("arguments[0].scrollTop = arguments[0].scrollHeight;", list(main))
      
      # Find link to next page.
      #
      paginate <- tryCatch({
        suppressMessages(
          driver$findElement(using = "css selector", "[data-cy='pagination-next'] a:first-child")
        )
      }, error = function(e) {
        NULL
      })
      if (is.null(paginate)) break
      URL <- paginate$getElementAttribute("href")[[1]]
      
      cat("Next page: ", URL, ".\n", sep = "")
      
      # Pause before advancing to next page.
      #
      Sys.sleep(15)
      driver$navigate(URL)
    }
    
    driver$close()
    

    The crux of this is that the scroll bar is attached to an element on the page rather than the page itself.

    Also, rather than trying to click the link for the next page, just extract the URL and then navigate to it.