Search code examples
rweb-scrapingrvestrseleniumdynamic-loading

R: Scraping a dynamically loading page with long but finite scrolling until the end (with RSelenium?)


This page almost seems infinite, as it shows over 6.000 profiles in a dynamically loading page.

A similar one only shows 310 profiles, so scrolling to its end does not require so much time.

Is there a way to write a single code that could scrape both pages by scrolling to the end?

For a similar purpose, I used a code with RSelenium like this:

journal_url <- "https://www.frontiersin.org/journals/photonics#editorial-board"

rD <- RSelenium::rsDriver(browser="chrome", port=4546L, verbose=F, chromever="87.0.4280.20")
    
for(i in 1:5){      
    remDr$executeScript(paste("scroll(0,",i*10000,");"))
    Sys.sleep(3)    
}

But in the present case, while scrolling five times as in for(i in 1:5) may perhaps suffice for the second page (with 350 profiles), it will not be sufficient for the first one (with 6.000 profiles). If someone could point me to a single code that could handle pages of varying sizes, I would be very grateful!


Solution

  • I believe you'll find your answer here,

    https://github.com/yusuzech/r-web-scraping-cheat-sheet#223-simulating-scrolls-clicks-text-inputs-logins-and-other-actions

    under the subheader "Scroll Down Until the End(Not Recommended if There Are too Many Pages)."

    Edit: Here's the suggested code from the link.

    element <- driver$findElement("css", "body")
    flag <- TRUE
    counter <- 0
    n <- 5
    while(flag){
        counter <- counter + 1
        #compare the pagesource every n(n=5) time, since sometimes one scroll down doesn't render new content
        for(i in 1:n){
            element$sendKeysToElement(list("key"="page_down"))
            Sys.sleep(2)
        }
        if(exists("pagesource")){
            if(pagesource == driver$getPageSource()[[1]]){
                flag <- FALSE
                writeLines(paste0("Scrolled down ",n*counter," times.\n"))
            } else {
                pagesource <- driver$getPageSource()[[1]]
            }
        } else {
            pagesource <- driver$getPageSource()[[1]]
        }
    }