Search code examples
selenium-webdriverweb-scrapingselenium-chromedriver

Unable to scrape Category Ratings from Glassdoor


I tried scraping reviews from Glassdoor using Python. Everything worked fine for the rating, pros, cons, date, job_title, and employee_type data. But when I tried to scrape the rating of the categories, it doesn't seem to work perfectly.

I first created the extract_star_rating method because each category can all have the same class names if they have the same rate according to this condition:

if the category has a class name of css-1mfncox e1hd5jg10 then it's rated 1 star , else if e1hd5jg10"> then 2 stars ..

Here's the extract_star_rating function:

`def extract_star_rating(review, category_name):
    xpath = f'//span[text()="{category_name}"]/ancestor::div[@class="common__EIReviewsRatingsStyles__RatingItemWrapper-sc-1dl5e6p-3 gdGrid"]//div[@class]'
    category_div = review.find_element(By.XPATH, xpath)
    class_name = category_div.get_attribute('class')
    if 'css-1mfncox' in class_name:
        return 1
    elif 'css-1lp3h8x' in class_name:
        return 2
    elif 'css-k58126' in class_name:
        return 3
    elif 'css-94nhxw' in class_name:
        return 4
    else:
        return 5`

Then, I called this function 6 times since it will be applied to the 6 columns of the dataframe. But I don't really know what to put in the parameters of this function when it's called.

`# loop through all pages
for i in range(1, 3697):
    # visit the page
    page_url = f"{url[:-4]}_P{i}.htm"
    driver.get(page_url)
    # get all of the review elements on the page
    review_elements = driver.find_elements(by=By.XPATH, value="//div[@class='gdReview']")
    # loop through each review element and extract the relevant information
    for element in review_elements:
        review = {}
        review['Work/Life Balance'] = extract_star_rating(element, 'Work/Life Balance')
        review['Culture & Values'] = extract_star_rating(element, 'Culture & Values')
        review['Diversity & Inclusion'] = extract_star_rating(element, 'Diversity & Inclusion')
        review['Career Opportunities'] = extract_star_rating(element, 'Career Opportunities')
        review['Compensation and Benefits'] = extract_star_rating(element, 'Compensation and Benefits')
        review['Senior Management'] = extract_star_rating(element, 'Senior Management')
        reviews.append(review)

This is the error I get:

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//span[text()="Work/Life Balance"]/ancestor::div[@class="common__EIReviewsRatingsStyles__RatingItemWrapper-sc-1dl5e6p-3 gdGrid"]//div[@


Solution

  • Please make below changes in your code

    #1 In extract_star_rating Method change xpath to below

    xpath = f'.//div[text()="{category_name}"]/following-sibling::div'
    

    #2 When you are going through all reviews there are cases where some Rating categories are not available so you have to handle that as well like below, if category is not present then set it to "N/A" for e.g. this is an example where categories are nor available for element in review_elements:

     try:
            review['Work/Life Balance'] = extract_star_rating(element, 'Work/Life Balance')
        except NoSuchElementException as e:
    
            review['Work/Life Balance'] = "N/A"
    

    #3 There is also a use case where there are no categories available at all only Total Rating is available, so in that case we will check if Category level rating is available otherwise return the total Rating added by user Updated Method for this

    def extract_star_rating(reviewElement, category_name):
    # Checking if Rating by Category is available
    try:
        reviewElement.find_element(By.XPATH, ".//aside")
    except NoSuchElementException:
        # Since Exception is thrown here that means Rating by Category is Not available so return total Rating
        print("No Category level Rating Info")
        rating = int(float(reviewElement.find_element(By.XPATH, "//span[contains(@class,'ratingNumber')]").text))
        return rating
    xpath = f'.//div[text()="{category_name}"]/following-sibling::div'
    
    # Processing as Rating by Category is available
    category_div = reviewElement.find_element(By.XPATH, xpath)
    class_name = category_div.get_attribute('class')
    if 'css-1mfncox' in class_name:
        return 1
    elif 'css-1lp3h8x' in class_name:
        return 2
    elif 'css-k58126' in class_name:
        return 3
    elif 'css-94nhxw' in class_name:
        return 4
    else:
        return 5
    

    Full Code which i have tested for the page added below , you can edit the code to add a for loop to scrape all page for you url , i have added a sample example for a single page

    from selenium.webdriver.common.by import By
    import undetected_chromedriver
    from selenium.common import NoSuchElementException
    
    base_url = 'https://www.glassdoor.co.in/Reviews/Cognizant-Technology-Solutions-Reviews-E8014_P3.htm?filter.iso3Language=eng'
    page_count = 442
    
    driver = undetected_chromedriver.Chrome()
    driver.get(base_url)
    # get all of the review elements on the page
    review_elements = driver.find_elements(by=By.XPATH, value="//div[@class='gdReview']")
    
    # loop through each review element and extract the relevant information
    reviews = []
    
    
    def extract_star_rating(reviewElement, category_name):
        # Checking if Rating by Category is available
        try:
            reviewElement.find_element(By.XPATH, ".//aside")
        except NoSuchElementException:
            # Since Exception is thrown here that means Rating by Category is Not available so return total Rating
            print("No Category level Rating Info")
            rating = int(float(reviewElement.find_element(By.XPATH, "//span[contains(@class,'ratingNumber')]").text))
            return rating
        xpath = f'.//div[text()="{category_name}"]/following-sibling::div'
    
        # Processing as Rating by Category is available
        category_div = reviewElement.find_element(By.XPATH, xpath)
        class_name = category_div.get_attribute('class')
        if 'css-1mfncox' in class_name:
            return 1
        elif 'css-1lp3h8x' in class_name:
            return 2
        elif 'css-k58126' in class_name:
            return 3
        elif 'css-94nhxw' in class_name:
            return 4
        else:
            return 5
    
    
    for element in review_elements:
        review = {}
        try:
            review['Work/Life Balance'] = extract_star_rating(element, 'Work/Life Balance')
        except NoSuchElementException as e:
            review['Work/Life Balance'] = "N/A"
    
        try:
            review['Culture & Values'] = extract_star_rating(element, 'Culture & Values')
        except NoSuchElementException as e:
            review['Culture & Values'] = "N/A"
    
        try:
            review['Diversity & Inclusion'] = extract_star_rating(element, 'Diversity and Inclusion')
        except NoSuchElementException as e:
            review['Diversity & Inclusion'] = "N/A"
    
        try:
            review['Career Opportunities'] = extract_star_rating(element, 'Career Opportunities')
        except NoSuchElementException as e:
            review['Career Opportunities'] = "N/A"
    
        try:
            review['Compensation and Benefits'] = extract_star_rating(element, 'Compensation and Benefits')
        except NoSuchElementException as e:
            review['Compensation and Benefits'] = "N/A"
    
        try:
            review['Senior Management'] = extract_star_rating(element, 'Senior Management')
        except Exception as e:
            review['Senior Management'] = "N/A"
    
        reviews.append(review)
    for r in reviews:
        print(r)
    

    It will extract all Ratings and print them

    {'Work/Life Balance': 2, 'Culture & Values': 4, 'Diversity & Inclusion': 4, 'Career Opportunities': 4, 'Compensation and Benefits': 4, 'Senior Management': 4}
    {'Work/Life Balance': 3, 'Culture & Values': 3, 'Diversity & Inclusion': 3, 'Career Opportunities': 3, 'Compensation and Benefits': 3, 'Senior Management': 3}
    {'Work/Life Balance': 5, 'Culture & Values': 5, 'Diversity & Inclusion': 5, 'Career Opportunities': 3, 'Compensation and Benefits': 5, 'Senior Management': 4}
    {'Work/Life Balance': 2, 'Culture & Values': 2, 'Diversity & Inclusion': 5, 'Career Opportunities': 4, 'Compensation and Benefits': 1, 'Senior Management': 2}
    {'Work/Life Balance': 5, 'Culture & Values': 5, 'Diversity & Inclusion': 5, 'Career Opportunities': 5, 'Compensation and Benefits': 3, 'Senior Management': 5}
    {'Work/Life Balance': 5, 'Culture & Values': 5, 'Diversity & Inclusion': 5, 'Career Opportunities': 5, 'Compensation and Benefits': 4, 'Senior Management': 5}
    {'Work/Life Balance': 2, 'Culture & Values': 2, 'Diversity & Inclusion': 2, 'Career Opportunities': 2, 'Compensation and Benefits': 2, 'Senior Management': 2}
    {'Work/Life Balance': 3, 'Culture & Values': 3, 'Diversity & Inclusion': 3, 'Career Opportunities': 3, 'Compensation and Benefits': 3, 'Senior Management': 3}
    {'Work/Life Balance': 4, 'Culture & Values': 4, 'Diversity & Inclusion': 3, 'Career Opportunities': 3, 'Compensation and Benefits': 4, 'Senior Management': 2}
    {'Work/Life Balance': 2, 'Culture & Values': 3, 'Diversity & Inclusion': 4, 'Career Opportunities': 3, 'Compensation and Benefits': 4, 'Senior Management': 2}
    

    Incase some categories are missing we will get

    {'Work/Life Balance': 'N/A', 'Culture & Values': 'N/A', 'Diversity & Inclusion': 'N/A', 'Career Opportunities': 1, 'Compensation and Benefits': 'N/A', 'Senior Management': 'N/A'}
    {'Work/Life Balance': 5, 'Culture & Values': 5, 'Diversity & Inclusion': 5, 'Career Opportunities': 5, 'Compensation and Benefits': 5, 'Senior Management': 5}
    {'Work/Life Balance': 1, 'Culture & Values': 1, 'Diversity & Inclusion': 1, 'Career Opportunities': 1, 'Compensation and Benefits': 1, 'Senior Management': 1}
    

    Note - There may still be more use cases you would need to handle related to ratings in some other pages