Search code examples
rweb-scrapingrvest

R-automating web page text scrape


I'm trying to automate scraping text from a website using rvest but I'm getting the error below when I try a loop that reads web page urls from vector: book.titles.urls. However, when I try to scrape the desired text from a single page (without the loop), it works just fine:

Working Code

library(rvest)
library(tidyverse)

#Paste URL to be read by read_html function
lex.url <- 'https://fab.lexile.com/search/results?keyword=The+True+Story+of+the+Three+Little+Pigs'
lex.webpage <- read_html(lex.url)

#Use CSS selectors to scrape lexile numbers and covert data to text
lex.num <- html_nodes(lex.webpage, '.results-lexile-code')
lex.num.txt <- html_text(lex.num[1])

lex.num.txt
> lex.num.txt
[1] "AD510L"

Reprex

library(rvest)
library(tidyverse)

book.titles <- c("The+True+Story+of+the+Three+Little+Pigs",
             "The+Teacher+from+the+Black+Lagoon",
             "A+Letter+to+Amy",
             "The+Principal+from+the+Black+Lagoon",
             "The+Art+Teacher+from+the+Black+Lagoon")
book.titles.urls <- paste0("https://fab.lexile.com/search/results?keyword=", book.titles)

out <- length(book.titles)
for (i in seq_along(book.titles.urls)) {
  node1 <- html_session(i)
  lex.url <- as.character(book.titles.urls[i])
  lex.webpage <- read_html(lex.url[i])
  lex.num <- html_nodes(node1, lex.webpage[i], '.results-lexile-code')
  lex.num.txt <- html_text(lex.num[i][1])
  out <- lex.num.txt[i]
}

Error code

Error in httr::handle(url) : is.character(url) is not TRUE


Solution

  • The error is due to you are passing an integer to the html_session function, the function is expecting a character string (i.e. a url). I do not believe it is necessary to create as session, generally this function is used if you need to log into the web site with as user id and password.

    You can simplify your loop:

    #output list
    output<-list()
    j<-1   #index
    for (i in book.titles.urls) {
      lex.num <- html_nodes(read_html(i), '.results-lexile-code')
      # process the  returned list of nodes, lex.num, here
      output[[j]]<-html_text(lex.num)
      j<-j+1
    }
    

    I have not tested this but I will provide this warning: When scraping a web site, please ensure you are agree and abide to terms of service agreement.

    Edit: Here is a further simplification using lapply which returns a list of vectors with the result of each call statement

    library(dplyr)
    listofresults<-lapply(book.titles.urls, function(i) {read_html(i) %>% 
        html_nodes( '.results-lexile-code') %>% 
        html_text()})