Search code examples
rweb-scrapingrvest

Scraping data using R and placing results in a data frame


I'm trying to scrape reviews from Glassdoor using Rvest and place the results in a data frame with one row per review. My code is below, but the section where I try to pull the sub-ratings (work-life balance, culture and values, etc) doesn't work. There are five different sub-ratings within a drop down, and one or more of them may be blank for each review. Here's my preliminary code. Do you have any suggestions for how I can pull the sub-ratings and put each sub-rating in a separate column in my data frame?

## Load libraries
library(httr)  
library(xml2)  
library(rvest) 
library(purrr) 
library(tidyverse)
library(lubridate)

## URL for scraping
url = "https://www.glassdoor.com/Reviews/Google-Reviews-E9079.htm"
pg_reviews = read_html(url)

##Create data frame and define features to scrape
Google_reviews = data.frame()

class.ratings = c()
styles = pg_reviews %>% html_elements('style')
for(s in styles) {
     class = s %>% html_attr('data-emotion-css')
     class = paste0('css-', class)
     rating = str_match(s %>% html_text2(), '(\\d+)%')[2]
     class.ratings[class] = as.numeric(rating)/20
}

reviews = pg_reviews %>% html_elements('.gdReview')

summary = pg_reviews %>% 
     html_elements(".reviewLink") %>% 
     html_text()

rating = pg_reviews %>%
     html_elements("#ReviewsFeed .mr-xsm") %>%
     html_text()

pros = pg_reviews %>%
     html_elements(".v2__EIReviewDetailsV2__fullWidth:nth-child(1) span") %>%
     html_text()

cons = pg_reviews %>%
     html_elements(".v2__EIReviewDetailsV2__fullWidth:nth-child(2) span") %>%
     html_text()

#Subratings--DOES NOT WORK
for(re in reviews) {
     subratings = re %>% html_elements('.content') %>% html_elements('li')
     for(i = 1 to 5) {
          
          label = i %>% html_element('div') %>% html_text()
          classes = i %>% html_elements('div[font-size="sm"]') %>% html_attr('class')
          class = str_split(classes, ' ')[[1]][1] # take the first class attribute
          cat(class.ratings[class], ',')
          
     }
work_life_balance <- subratings(1)
culture_values <- subratings(2)
career_opportunities <- subratings(3)
comp_benefits <- subratings(4)
management <- subratings(5)



}


Google_reviews = rbind(Google_reviews,data.frame(summary,rating,pros,cons,work_life_balance,culture_values
                                                 career_opportunities,comp_benefits,management))
'''

Solution

  • It was not trivial to obtain the sub rankings and parse into a dataframe.
    See comments for details.

    Updated

    library(rvest)
    
    url = "https://www.glassdoor.com/Reviews/Google-Reviews-E9079.htm"
    pg_reviews = read_html(url)
    
    library(stringr)
    #the ratings are stored in a data structure in a script
    #find all the scripts and then search
    scripts<-pg_reviews %>% html_elements(xpath='//script')
    
    #search the scripts for the ratings
    ratingsScript <- which(grepl("ratingCareerOpportunities", scripts))
    #filter the script down to just the data.  This is JSON like haven't figured out the beginning or end
    data <-scripts[ratingsScript] %>% html_text2() %>% str_extract("\"urlParams\":.+\\}\\}\\}\\}") 
    
    
    #extract the ratings
    WorkLifeBalance  <- str_extract_all(data, '(?<="ratingWorkLifeBalance":)\\d') %>% unlist() %>% as.integer()
    CultureAndValues <- str_extract_all(data, '(?<="ratingCultureAndValues":)\\d') %>% unlist() %>% as.integer()
    DiversityAndInclusion        <- str_extract_all(data, '(?<="ratingDiversityAndInclusion":)\\d') %>% unlist() %>% as.integer()
    SeniorLeadership <- str_extract_all(data, '(?<="ratingSeniorLeadership":)\\d') %>% unlist() %>% as.integer()
    CareerOpportunities <- str_extract_all(data, '(?<="ratingCareerOpportunities":)\\d') %>% unlist() %>% as.integer()
    CompensationAndBenefits<- str_extract_all(data, '(?<="ratingCompensationAndBenefits":)\\d') %>% unlist() %>% as.integer()
    
    ratings <- cbind(WorkLifeBalance, CultureAndValues, DiversityAndInclusion, SeniorLeadership, CareerOpportunities, CompensationAndBenefits)
    
          WorkLifeBalance CultureAndValues DiversityAndInclusion SeniorLeadership CareerOpportunities CompensationAndBenefits
     [1,]               2                4                     2                4                   5                       4
     [2,]               2                3                     0                3                   3                       5
     [3,]               5                4                     0                4                   5                       5
     [4,]               5                5                     5                5                   5                       5
     [5,]               0                0                     0                0                   1                       0
     [6,]               5                5                     5                5                   5                       5
     [7,]               0                0                     0                0                   0                       0
     [8,]               0                0                     0                0                   0                       0
     [9,]               0                0                     0                0                   0                       0
    [10,]               0                0                     0                0                   0                       0
    

    All of the information associated with the reviews should be stored in the "data" variable. This is appears to be JSON, but I can't determine the correct start and stopping points, thus the need to manually extract the ratings.
    The last line will provide a data frame with 1 row per review and a column for each of the different categories in the sub rankings. You may want to convert the 0 to NA. You can cbind() this to your "Google_reviews" data frame.