I tried scraping reviews from Glassdoor using Python. Everything worked fine for the rating, pros, cons, date, job_title, and employee_type data. But when I tried to scrape the rating of the categories, it doesn't seem to work perfectly.
I first created the extract_star_rating method because each category can all have the same class names if they have the same rate according to this condition:
if the category has a class name of css-1mfncox e1hd5jg10 then it's rated 1 star , else if e1hd5jg10"> then 2 stars ..
Here's the extract_star_rating function:
`def extract_star_rating(review, category_name):
xpath = f'//span[text()="{category_name}"]/ancestor::div[@class="common__EIReviewsRatingsStyles__RatingItemWrapper-sc-1dl5e6p-3 gdGrid"]//div[@class]'
category_div = review.find_element(By.XPATH, xpath)
class_name = category_div.get_attribute('class')
if 'css-1mfncox' in class_name:
return 1
elif 'css-1lp3h8x' in class_name:
return 2
elif 'css-k58126' in class_name:
return 3
elif 'css-94nhxw' in class_name:
return 4
else:
return 5`
Then, I called this function 6 times since it will be applied to the 6 columns of the dataframe. But I don't really know what to put in the parameters of this function when it's called.
`# loop through all pages
for i in range(1, 3697):
# visit the page
page_url = f"{url[:-4]}_P{i}.htm"
driver.get(page_url)
# get all of the review elements on the page
review_elements = driver.find_elements(by=By.XPATH, value="//div[@class='gdReview']")
# loop through each review element and extract the relevant information
for element in review_elements:
review = {}
review['Work/Life Balance'] = extract_star_rating(element, 'Work/Life Balance')
review['Culture & Values'] = extract_star_rating(element, 'Culture & Values')
review['Diversity & Inclusion'] = extract_star_rating(element, 'Diversity & Inclusion')
review['Career Opportunities'] = extract_star_rating(element, 'Career Opportunities')
review['Compensation and Benefits'] = extract_star_rating(element, 'Compensation and Benefits')
review['Senior Management'] = extract_star_rating(element, 'Senior Management')
reviews.append(review)
This is the error I get:
NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//span[text()="Work/Life Balance"]/ancestor::div[@class="common__EIReviewsRatingsStyles__RatingItemWrapper-sc-1dl5e6p-3 gdGrid"]//div[@
Please make below changes in your code
#1 In extract_star_rating
Method change xpath to below
xpath = f'.//div[text()="{category_name}"]/following-sibling::div'
#2 When you are going through all reviews there are cases where some Rating categories are not available so you have to handle that as well like below, if category is not present then set it to "N/A" for e.g. this is an example where categories are nor available for element in review_elements:
try:
review['Work/Life Balance'] = extract_star_rating(element, 'Work/Life Balance')
except NoSuchElementException as e:
review['Work/Life Balance'] = "N/A"
#3 There is also a use case where there are no categories available at all only Total Rating is available, so in that case we will check if Category level rating is available otherwise return the total Rating added by user Updated Method for this
def extract_star_rating(reviewElement, category_name):
# Checking if Rating by Category is available
try:
reviewElement.find_element(By.XPATH, ".//aside")
except NoSuchElementException:
# Since Exception is thrown here that means Rating by Category is Not available so return total Rating
print("No Category level Rating Info")
rating = int(float(reviewElement.find_element(By.XPATH, "//span[contains(@class,'ratingNumber')]").text))
return rating
xpath = f'.//div[text()="{category_name}"]/following-sibling::div'
# Processing as Rating by Category is available
category_div = reviewElement.find_element(By.XPATH, xpath)
class_name = category_div.get_attribute('class')
if 'css-1mfncox' in class_name:
return 1
elif 'css-1lp3h8x' in class_name:
return 2
elif 'css-k58126' in class_name:
return 3
elif 'css-94nhxw' in class_name:
return 4
else:
return 5
Full Code which i have tested for the page added below , you can edit the code to add a for loop to scrape all page for you url , i have added a sample example for a single page
from selenium.webdriver.common.by import By
import undetected_chromedriver
from selenium.common import NoSuchElementException
base_url = 'https://www.glassdoor.co.in/Reviews/Cognizant-Technology-Solutions-Reviews-E8014_P3.htm?filter.iso3Language=eng'
page_count = 442
driver = undetected_chromedriver.Chrome()
driver.get(base_url)
# get all of the review elements on the page
review_elements = driver.find_elements(by=By.XPATH, value="//div[@class='gdReview']")
# loop through each review element and extract the relevant information
reviews = []
def extract_star_rating(reviewElement, category_name):
# Checking if Rating by Category is available
try:
reviewElement.find_element(By.XPATH, ".//aside")
except NoSuchElementException:
# Since Exception is thrown here that means Rating by Category is Not available so return total Rating
print("No Category level Rating Info")
rating = int(float(reviewElement.find_element(By.XPATH, "//span[contains(@class,'ratingNumber')]").text))
return rating
xpath = f'.//div[text()="{category_name}"]/following-sibling::div'
# Processing as Rating by Category is available
category_div = reviewElement.find_element(By.XPATH, xpath)
class_name = category_div.get_attribute('class')
if 'css-1mfncox' in class_name:
return 1
elif 'css-1lp3h8x' in class_name:
return 2
elif 'css-k58126' in class_name:
return 3
elif 'css-94nhxw' in class_name:
return 4
else:
return 5
for element in review_elements:
review = {}
try:
review['Work/Life Balance'] = extract_star_rating(element, 'Work/Life Balance')
except NoSuchElementException as e:
review['Work/Life Balance'] = "N/A"
try:
review['Culture & Values'] = extract_star_rating(element, 'Culture & Values')
except NoSuchElementException as e:
review['Culture & Values'] = "N/A"
try:
review['Diversity & Inclusion'] = extract_star_rating(element, 'Diversity and Inclusion')
except NoSuchElementException as e:
review['Diversity & Inclusion'] = "N/A"
try:
review['Career Opportunities'] = extract_star_rating(element, 'Career Opportunities')
except NoSuchElementException as e:
review['Career Opportunities'] = "N/A"
try:
review['Compensation and Benefits'] = extract_star_rating(element, 'Compensation and Benefits')
except NoSuchElementException as e:
review['Compensation and Benefits'] = "N/A"
try:
review['Senior Management'] = extract_star_rating(element, 'Senior Management')
except Exception as e:
review['Senior Management'] = "N/A"
reviews.append(review)
for r in reviews:
print(r)
It will extract all Ratings and print them
{'Work/Life Balance': 2, 'Culture & Values': 4, 'Diversity & Inclusion': 4, 'Career Opportunities': 4, 'Compensation and Benefits': 4, 'Senior Management': 4}
{'Work/Life Balance': 3, 'Culture & Values': 3, 'Diversity & Inclusion': 3, 'Career Opportunities': 3, 'Compensation and Benefits': 3, 'Senior Management': 3}
{'Work/Life Balance': 5, 'Culture & Values': 5, 'Diversity & Inclusion': 5, 'Career Opportunities': 3, 'Compensation and Benefits': 5, 'Senior Management': 4}
{'Work/Life Balance': 2, 'Culture & Values': 2, 'Diversity & Inclusion': 5, 'Career Opportunities': 4, 'Compensation and Benefits': 1, 'Senior Management': 2}
{'Work/Life Balance': 5, 'Culture & Values': 5, 'Diversity & Inclusion': 5, 'Career Opportunities': 5, 'Compensation and Benefits': 3, 'Senior Management': 5}
{'Work/Life Balance': 5, 'Culture & Values': 5, 'Diversity & Inclusion': 5, 'Career Opportunities': 5, 'Compensation and Benefits': 4, 'Senior Management': 5}
{'Work/Life Balance': 2, 'Culture & Values': 2, 'Diversity & Inclusion': 2, 'Career Opportunities': 2, 'Compensation and Benefits': 2, 'Senior Management': 2}
{'Work/Life Balance': 3, 'Culture & Values': 3, 'Diversity & Inclusion': 3, 'Career Opportunities': 3, 'Compensation and Benefits': 3, 'Senior Management': 3}
{'Work/Life Balance': 4, 'Culture & Values': 4, 'Diversity & Inclusion': 3, 'Career Opportunities': 3, 'Compensation and Benefits': 4, 'Senior Management': 2}
{'Work/Life Balance': 2, 'Culture & Values': 3, 'Diversity & Inclusion': 4, 'Career Opportunities': 3, 'Compensation and Benefits': 4, 'Senior Management': 2}
Incase some categories are missing we will get
{'Work/Life Balance': 'N/A', 'Culture & Values': 'N/A', 'Diversity & Inclusion': 'N/A', 'Career Opportunities': 1, 'Compensation and Benefits': 'N/A', 'Senior Management': 'N/A'}
{'Work/Life Balance': 5, 'Culture & Values': 5, 'Diversity & Inclusion': 5, 'Career Opportunities': 5, 'Compensation and Benefits': 5, 'Senior Management': 5}
{'Work/Life Balance': 1, 'Culture & Values': 1, 'Diversity & Inclusion': 1, 'Career Opportunities': 1, 'Compensation and Benefits': 1, 'Senior Management': 1}
Note - There may still be more use cases you would need to handle related to ratings in some other pages