Search code examples
rrvest

Rvest web scraping, character(empty)


I have done web scraping a few times with Rvest. But I have not had any attempt at web scraping where I got character(empty) from a request. Is this a sign that the site is preventing me from scraping data from their site? Or is this some type of Javascript/Json query?

library(rvest)
library(robotstxt)

##checking the website Rvest and Robotstxt
paths_allowed("https://www.ratemyprofessors.com/search/teachers?query=*&sid=668htm")
njit <- read_html("https://www.ratemyprofessors.com/search/teachers?query=*&sid=668htm")

##Checking file type
class(njit)

##extracting professor names
prof <- njit %>%
  xml_nodes(".cJdVEK") %>%
  html_text2()

Solution

  • I'm assuming you want to pull all the professors associated with the New Jersey Institute of Technology? I scraped the page at this link: https://www.ratemyprofessors.com/search/teachers?query=*&sid=668 (It's the original link minus the "htm" at the end).

    Because the page uses JavaScript to return content the html rvest sees is different than the html the user sees. Additionally the results are loaded dynamically as the user scrolls down. Here's a way to use RSelenium to automate the web browser to keep scrolling until it's found all 1,000 or so professors at this university:

    # load libraries
    library(RSelenium)
    library(rvest)
    library(magrittr)
    library(readr)
    
    # define target url
    url <- "https://www.ratemyprofessors.com/search/teachers?query=*&sid=668"
    
    
    # start RSelenium ------------------------------------------------------------
    
    rD <- rsDriver(browser="firefox", port=4550L, chromever = NULL)
    remDr <- rD[["client"]]
    
    # open the remote driver-------------------------------------------------------
    # If it doesn't open automatically:
    remDr$open()
    
    # Navigate to webpage -----------------------------------------------------
    remDr$navigate(url)
    
    
    # Close "this site uses cookies" button
    remDr$findElement(using = "css",value = "button.Buttons__Button-sc-19xdot-1:nth-child(3)")$clickElement()
    
    
    # Find the number of profs
    # pull the webpage html
    # then read it
    page_html <- remDr$getPageSource()[[1]] %>% 
      read_html()
    
    # extract the number of results
    number_of_profs <- page_html %>% 
                      html_node("h1") %>% 
                      html_text() %>% 
                      parse_number()
    
    
    # Define a variable for the number of results we've pulled
    number_of_profs_pulled <- 0
    
    
    # While the number of scraped results is less than the number of total results we keep
    # scrolling and pulling the html
    
    while(number_of_profs > number_of_profs_pulled){
    
    
    # scroll down the page
    # Root is the html id of the container that the search results
    # we want to scroll just to the bottom of the search results not the bottom
    # of the page, because it looks like the 
    # "click for more results" button doesn't appear in the html 
    # unless you're litterally right at that part of the page
    webElem <- remDr$findElement("css", ".SearchResultsPage__StyledSearchResultsPage-vhbycj-0")
    #webElem$sendKeysToElement(list(key = "end"))
    webElem$sendKeysToElement(list(key = "down_arrow"))
    
    
    # click on the show more button ------------------------------------
    remDr$findElement(using = "css",value = ".Buttons__Button-sc-19xdot-1")$clickElement()
    
    
    # pull the webpage html
    # then read it
    page_html <- remDr$getPageSource()[[1]] %>% 
      read_html()
    
    
    ##extract professor names
    prof_names <- page_html %>%
      html_nodes(".cJdVEK") %>%
      html_text()
    
    
    # update the number of profs we pulled
    # so we know if we need to keep running the loop
    number_of_profs_pulled <- length(prof_names)
    
    }
    

    Results

    > str(prof_names)
     chr [1:1250] "David Whitebook" "Donald Getzin" "Joseph Frank" "Soroosh Mohebbi" "Robert Lynch" "Don Wall" "Denis Blackmore" "Soha Abdeljaber" "Lamine Dieng" "Yehoshua Perl" "Douglas Burris" ...
    > 
    
    

    Caveats:

    1. This will be slow because you have to keep waiting for the page to reload.
    2. You may want to slow it down further to avoid the site blocking you as a bot. You can also add random mouse and key movements using RSelenium to reduce the risk of getting blocked.