Search code examples
rdictionaryfor-looptidyversescreen-scraping

Scraping multiple articles by using purrr::map, not for loop in R


Hi dear community members.

I'm now trying to get the data of article titles on this website (https://yomou.syosetu.com/search.php?&type=er&order_former=search&order=new&notnizi=1&p=1) by R.

I executed the following code.

### read HTML ###
html_narou <- rvest::read_html("https://yomou.syosetu.com/search.php?&type=er&order_former=search&order=new&notnizi=1&p=1",
                               encoding = "UTF-8")

### create the common part object of CSS ###
base_css_former <- "#main_search > div:nth-child("
base_css_latter <- ") > div > a"

### create NULL objects ###
art_css <- NULL
narou_titles <- NULL

### extract the title data and store them into the NULL object ###
#### The titles of the articles doesn't exist in the " #main_search > div:nth-child(1~4) > div > a ", so i in the loop starts from five ####
for (i in 5:24) {
  art_css <- paste0(base_css_former, as.character(i), base_css_latter) 
  
  narou_title <- rvest::html_element(x = html_narou,
                                     css = art_css) %>% 
    rvest::html_text()

  narou_titles <- base::append(narou_titles, narou_title)
}

But it takes long to do this by for-loop in R and I want to use "map" function in "purrr" instead. However I'm not familiar with purrr::map and the process is complicated. How can I substitute map for for-loop?


Solution

  • The real issue is that you’re increasing the size of your narou_titles vector on every iteration, which is notoriously slow in R. Instead, you should pre-allocate the vector to its final length, then assign elements by index. purrr does this behind the scenes, which can make it appear faster, but you can do the same thing without purrr.

    With your for loop:

    library(rvest)
    
    narou_titles <- vector("character", 20)
    for (i in 5:24) {
      art_css <- paste0(base_css_former, as.character(i), base_css_latter) 
      
      narou_titles[[i]] <- html_element(
          x = html_narou,
          css = art_css
        ) %>% 
        html_text()
    }
    

    With purrr::map_chr():

    library(rvest)
    library(purrr)
    
    get_title <- function(i) {
      art_css <- paste0(base_css_former, as.character(i), base_css_latter)  
      html_element(
        x = html_narou,
        css = art_css
      ) %>% 
      html_text()
    }
    narou_titles <- map_chr(5:24, get_title)