Search code examples
rfor-loopweb-scrapingrcurlrvest

R For loop unwanted overwrite


I would like every result of the loop in a different text(somename).

Right now the loop overwrites;

library(rvest)

main.page <- read_html(x = "http://www.imdb.com/event/ev0000681/2016")
urls <- main.page %>% # feed `main.page` to the next step
    html_nodes(".alt:nth-child(2) strong a") %>% # get the CSS nodes
    html_attr("href") # extract the URLs


for (i in urls){
    a01 <- paste0("http://www.imdb.com",i)
    text <- read_html(a01) %>% # load the page
            html_nodes(".credit_summary_item~ .credit_summary_item+ .credit_summary_item .itemprop , .summary_text+ .credit_summary_item .itemprop") %>% # isloate the text
            html_text()
}        

How could I code it in such a way that the 'i' from the list is added tot text in the for statement?


Solution

  • To solidify my comment:

    main.page <- read_html(x = "http://www.imdb.com/event/ev0000681/2016")
    urls <- main.page %>% # feed `main.page` to the next step
        html_nodes(".alt:nth-child(2) strong a") %>% # get the CSS nodes
        html_attr("href") # extract the URLs
    
    texts <- sapply(head(urls, n = 3), function(i) {
      read_html(paste0("http://www.imdb.com", i)) %>%
        html_nodes(".credit_summary_item~ .credit_summary_item+ .credit_summary_item .itemprop , .summary_text+ .credit_summary_item .itemprop") %>%
        html_text()
      }, simplify = FALSE)
    str(texts)
    # List of 3
    #  $ /title/tt5843990/: chr [1:4] "Lav Diaz" "Charo Santos-Concio" "John Lloyd Cruz" "Michael De Mesa"
    #  $ /title/tt4551318/: chr [1:4] "Andrey Konchalovskiy" "Yuliya Vysotskaya" "Peter Kurth" "Philippe Duquesne"
    #  $ /title/tt4550098/: chr [1:4] "Tom Ford" "Amy Adams" "Jake Gyllenhaal" "Michael Shannon"
    

    If you use lapply(...), you'll get an unnamed list, which may or may not be a problem for you. Instead, using sapply(..., simplify = FALSE), we get a named list where each name is (in this case) the partial url retrieved from urls.

    Using sapply without simplify can lead to unexpected outputs. As an example:

    set.seed(9)
    sapply(1:3, function(i) rep(i, sample(3, size=1)))
    # [1] 1 2 3
    

    One may think that this will always return a vector. However, if any of the single elements returned is not the same length (for instance) as the others, then the vector becomes a list:

    set.seed(10)
    sapply(1:3, function(i) rep(i, sample(3, size=1)))
    # [[1]]
    # [1] 1 1
    # [[2]]
    # [1] 2
    # [[3]]
    # [1] 3 3
    

    In which case, it's best to have certainty in the return value, forcing a list:

    set.seed(9)
    sapply(1:3, function(i) rep(i, sample(3, size=1)), simplify = FALSE)
    # [[1]]
    # [1] 1
    # [[2]]
    # [1] 2
    # [[3]]
    # [1] 3
    

    That way, you always know exactly how to reference sub-returns. (This is one of the tenets and advantages to Hadley's purrr package: each function always returns a list of exactly the type you declare. (There are other advantages to the package.)