Search code examples
rfor-loopdplyrrvestnested-for-loop

Nested For-Loop Failed To Store Data From Previous Iteration


I'm new to web scraping actually and just learn about it last night.

Brief:

I'm trying to scrape the Science Direct web page while I'm logging in to my account.

I'm trying to store all the titles in each iteration (there are three pages, i.e., three iterations), and for each iteration, I have to crawl I did another for loop to read the 25 uniques IDs for each title within each iteration.

However, it only stored the titles from the last iteration (the 3rd page).

I know the code is working when I did only a single page scraping, but when I trying to scrape the 'Next' page using the first for loop:

'''
for (i in seq (from = 0, to = 50, by = 25)) {

'''

As I said earlier, the code only stored the last iteration (i.e., the 3rd page that consists of 25 articles).

By the way, each page containing an option to display several articles per page either 25, 50, or 100 articles and I chose 25 hence the sequence = 25.

Here is the code:

#install.packages("xml2") # required for rvest
library("rvest") # for web scraping
library("dplyr") # for data management

titleNo = c()
name = list()
for(i in seq(from = 0, to = 50, by = 25)) {
  link = paste0("https://www.sciencedirect.com/search?qs=PISA%2C%20Programme%20for%20International%20Student%20Assessment&date=2010-2021&articleTypes=FLA&lastSelectedFacet=subjectAreas&subjectAreas=3300%2C3200%2C2800%2C2000%2C1200%2C1700%2C1400%2C1800%2C2200&offset=",i,"")
  for(j in 1:26) {
    page = read_html(link)
    titleNo[j] = (paste0(".push-m:nth-child(",j,") h2"))
    name[j] <- list(page %>% html_nodes(titleNo[j])%>% html_text())
  }
  print(paste(i))
}

name <- data.frame(unlist(name))

Can you guys point out what am I doing wrong?

The code successfully runs through all pages however my problem is, for each iteration, the code wipe out the name variable and store the new one until the last iteration.

I think my problem lies in my for-loop. I'm not sure if I'm doing the right thing or not.

Thanks


Solution

  • I think you are overcomplicating this. You can extract 25 titles in one-go using proper css-selectors.

    You can then unlist the results to get them as one combined vector.

    library(rvest)
    
    values <- seq (from = 25, to = 50, by = 25)
    link <- paste0("https://www.sciencedirect.com/search?qs=PISA%2C%20Programme%20for%20International%20Student%20Assessment&date=2010-2021&articleTypes=FLA&lastSelectedFacet=subjectAreas&subjectAreas=3300%2C3200%2C2800%2C2000%2C1200%2C1700%2C1400%2C1800%2C2200&offset=", values)
    
    result <- lapply(link, function(x) x %>%
          read_html() %>%
          html_nodes('div.result-item-content h2 span a') %>%
          html_text())
    titles <- unlist(result)
    titles
    
     #[1] "Computer-generated log-file analyses as a window into students' minds? A showcase study based on the PISA 2012 assessment of problem solving"                                                 
     #[2] "The Comparison between Successful and Unsuccessful Countries in PISA, 2009"                                                                                                                   
     #[3] "Educational Data Mining: Identification of factors associated with school effectiveness in PISA assessment"                                                                                   
     #[4] "Curriculum standardization, stratification, and students’ STEM-related occupational expectations: Evidence from PISA 2006"                                                                    
     #[5] "Testing measurement invariance of PISA 2015 mathematics, science, and ICT scales using the alignment method"                                                                                  
     #[6] "Effects of students’ and schools’ characteristics on mathematics achievement: findings from PISA 2006"
    #...
    #...