r web-scraping rvest rselenium dynamic-loading

R: Scraping a dynamically loading page with long but finite scrolling until the end (with RSelenium?)

This page almost seems infinite, as it shows over 6.000 profiles in a dynamically loading page.

A similar one only shows 310 profiles, so scrolling to its end does not require so much time.

Is there a way to write a single code that could scrape both pages by scrolling to the end?

For a similar purpose, I used a code with RSelenium like this:

journal_url <- "https://www.frontiersin.org/journals/photonics#editorial-board"

rD <- RSelenium::rsDriver(browser="chrome", port=4546L, verbose=F, chromever="87.0.4280.20")
    
for(i in 1:5){      
    remDr$executeScript(paste("scroll(0,",i*10000,");"))
    Sys.sleep(3)    
}

But in the present case, while scrolling five times as in for(i in 1:5) may perhaps suffice for the second page (with 350 profiles), it will not be sufficient for the first one (with 6.000 profiles). If someone could point me to a single code that could handle pages of varying sizes, I would be very grateful!

Solution

I believe you'll find your answer here,

https://github.com/yusuzech/r-web-scraping-cheat-sheet#223-simulating-scrolls-clicks-text-inputs-logins-and-other-actions

under the subheader "Scroll Down Until the End(Not Recommended if There Are too Many Pages)."

Edit: Here's the suggested code from the link.

element <- driver$findElement("css", "body")
flag <- TRUE
counter <- 0
n <- 5
while(flag){
    counter <- counter + 1
    #compare the pagesource every n(n=5) time, since sometimes one scroll down doesn't render new content
    for(i in 1:n){
        element$sendKeysToElement(list("key"="page_down"))
        Sys.sleep(2)
    }
    if(exists("pagesource")){
        if(pagesource == driver$getPageSource()[[1]]){
            flag <- FALSE
            writeLines(paste0("Scrolled down ",n*counter," times.\n"))
        } else {
            pagesource <- driver$getPageSource()[[1]]
        }
    } else {
        pagesource <- driver$getPageSource()[[1]]
    }
}