Search code examples
rweb-scrapingrvest

I cannot scrape one element of Glassdoor.com using R


I am trying to scrape some data from Glassdoor.com for a project. This is the code I have so far so scrape it:

## Load libraries
library(httr)  
library(xml2)  
library(rvest) 
library(purrr) 
library(tidyverse)
library(lubridate)

  # URLS for scraping
  start_url <- "https://www.glassdoor.co.uk/Reviews/Company-Reviews-"
  settings_url <- ".htm?sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng"
  
  
  ### Scrape Reviews
  map_df(1:1, function(i){
    Sys.sleep(3)
    tryCatch({
      pg_reviews <- read_html(GET(paste(start_url, "E8450", "_P", i, settings_url, sep = "")))
      table = pg_reviews %>% 
        html_elements(".mb-0")
      
      data.frame(date = pg_reviews %>% 
                   html_elements(".middle.common__EiReviewDetailsStyle__newGrey") %>% 
                   html_text2(),
                 
                 summary = pg_reviews %>% 
                   html_elements(".reviewLink") %>% 
                   html_text(),
                 
                 rating = pg_reviews %>%
                   html_elements("#ReviewsFeed .mr-xsm") %>%
                   html_text(),
                 
                 employee_type = pg_reviews %>%
                   html_elements(".eg4psks0") %>%
                   html_text(),
                 
                 pros = pg_reviews %>%
                   html_elements(".v2__EIReviewDetailsV2__fullWidth:nth-child(1) span") %>%
                   html_text(),
                 
                 cons = pg_reviews %>%
                   html_elements(".v2__EIReviewDetailsV2__fullWidth:nth-child(2) span") %>%
                   html_text()
                 
      )}, error = function(e){
        NULL
      })
    
  }) -> reviews_df

Until here everything works fine. However, I would also like to scrape the individual ratings on some of the reviews: picture

But I am really struggling to find the specific element referring to those ratings. I would love to suggest my take but I am completely lost on this one. I have been tried with SelectorGadget and also by inspecting the page but I cannot seem to manage.

Any suggestions?


Solution

  • Locating the data

    Inspecting the stars in those ratings, shows they are in the following HTML structure:

    ...
    <div class="content">
      <ul class="pl-0">
        <li>
          <div>Work/Life Balance</div>
          <div font-size="sm" class="css-18v8tui e1hd5jg10">
            <span class="gd-ui-star  css-vk03c5 e7cj4650" color="#0caa41" font-size="sm" tabindex="0" role="presentation">★</span>
            <span class="gd-ui-star  css-vk03c5 e7cj4650" color="#0caa41" font-size="sm" tabindex="0" role="presentation">★</span>
            <span class="gd-ui-star  css-vk03c5 e7cj4650" color="#0caa41" font-size="sm" tabindex="0" role="presentation">★</span>
            <span class="gd-ui-star  css-vk03c5 e7cj4650" color="#0caa41" font-size="sm" tabindex="0" role="presentation">★</span>
            <span class="gd-ui-star  css-vk03c5 e7cj4650" color="#0caa41" font-size="sm" tabindex="0" role="presentation">★</span>
          </div>
        </li>
        <li>
          <div>Culture &amp; Values</div>
          <div font-size="sm" class="css-18v8tui e1hd5jg10">
            ...
            
    

    CSS defines how many of them are colored, through the class of the div immediately under 'Work/Life Balance', eg:

    <div font-size="sm" class="css-18v8tui e1hd5jg10">
    

    We find the corresponding CSS elsewhere in the document:

    <style data-emotion-css="18v8tui">
      .css-18v8tui {
        display: inline-block;
        line-height: 1;
        background: linear-gradient(90deg, #0caa41 40%, #dee0e3 40%);
        -webkit-letter-spacing: 3px;
        -moz-letter-spacing: 3px;
        -ms-letter-spacing: 3px;
        letter-spacing: 3px;
        -webkit-background-clip: text;
        -webkit-text-fill-color: transparent;                                                                                                                        
      }
    </style>
    

    The 40% in the background value sets 40% of the div to yellow, making 40% of the stars light up in this example.

    Extracting the data

    First we load the page

    url = "https://www.glassdoor.co.uk/Reviews/PwC-Reviews-E8450.htm?sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng"
    pg_reviews = read_html(url)
    

    Then we extract all <style> elements, each one containing a single class in this case. We take any ...% value in the CSS class, and divide it by 20 to convert from a percentage to a number of stars. We save this number of stars in a named vector, where the name of each field is the name of the corresponding CSS class. This will allows us to correlate a rating's CSS class to a number of stars.

    class.ratings = c()
    styles = pg_reviews %>% html_elements('style')
    for(s in styles) {
      class = s %>% html_attr('data-emotion-css')
      class = paste0('css-', class)
      rating = str_match(s %>% html_text2(), '(\\d+)%')[2]
      class.ratings[class] = as.numeric(rating)/20
    }
    
    > class.ratings
    css-animation-1fypb1g           css-197m635            css-67i7qe 
                       NA                   5.0                   5.0 
               css-3x0lbp            css-hdvrkk            css-8hewl0 
                       NA                   5.0                   5.0 
              css-1x8evti           css-1ohf0ui           css-1htgz7a 
                       NA                    NA                   5.0 
     ...
    

    Not every percentage that we found really correlates to a star-rating, but that's okay.

    Finally we grab all reviews, each in an element with class gdReview. For each review we grab all star-ratings, each in an element with class content, in a li element. For each star-rating we extract the text label and the CSS class for the number of stars. I don't do anything to export the results, just output them to the console:

    reviews = pg_reviews %>% html_elements('.gdReview')
    for(re in reviews) {
      
      ratings = re %>% html_elements('.content') %>% html_elements('li')
      for(ra in ratings) {
        
        label = ra %>% html_element('div') %>% html_text()
        classes = ra %>% html_elements('div[font-size="sm"]') %>% html_attr('class')
        class = str_split(classes, ' ')[[1]][1] # take the first class attribute
        
        cat(label, class.ratings[class], '\n')
        
      }
    
      cat('\n')
      
    }
    

    output:

    Work/Life Balance 5 
    Culture & Values 5 
    Diversity & Inclusion 5 
    Career Opportunities 5 
    ...
    

    Since not every review contains these star-ratings per subcategory, there will be some empty fields.