I am trying to automate the data entry for a project I'm working on by making a selenium code to pull all of the data from a survey report page. My current issue is that for some reason one of the lines of code won't recognize a button in the HTML of the webpage to click it. The basic layout of the webpage follows a mostly consistent pattern of each question made into a drop down menu that shows several graphs of disaggregations of the data. Each graph also has its own menu button. What I want to accomplish with this code is to open each question, then open each disaggregation menu button, click "View data table", harvest the data, then repeat for each disaggregation in a question and for all questions. Once all the data is scraped, I'll work on formatting it how I need it and putting it in a csv.
For the goal I've described, I haven't finished writing the code entirely yet, but I have hit a snag pretty early on that I have been trying to fix for days now. For the code below, the button to click on the disaggregation menu won't function (either gives me an error or just gets ignored altogether). The question drop down works, but there is something wrong with the disaggregations part (disagg_buttons). See below:
URL = "https://secure.panoramaed.com/ride/understand/1302972/survey_results/27746568#/questions"
URL2 = "https://secure.panoramaed.com/ride/understand?auth_token=geZrUH8yRr8_Ln_C9LH3"
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import os
driver=webdriver.Firefox()
driver.get(URL2)
driver.get(URL)
html_text = driver.page_source
# Finds each question in the webpage based on the class "expandable-row". This will identify all of the questions on the page.
questions_buttons = driver.find_elements(By.CLASS_NAME, "expandable-row")
# Opens each drop down menu so that we can see all of the disaggregations of each question.
for question_button in questions_buttons:
question_button.click()
# Finds each disaggregation breakdown button by the class "highcharts-a11y-proxy-button.highcharts-no-tooltip".
disagg_buttons = driver.find_elements(By.CSS_SELECTOR, 'button.highcharts-a11y-proxy-button highcharts-no-tooltip')
for disagg_button in disagg_buttons:
disagg_button.click()
# When the disaggregation button is clicked, this will select the first item in the list which is the "View Data Table" option.
view_datatable_button = driver.find_element(By.CLASS_NAME, "li")
view_datatable_button.click()
rendered_html = driver.find_elements(By.CLASS_NAME, "ng-scope")
for e in rendered_html:
pass
# soup=BeautifulSoup(html_text, "html.parser")
# soup_pretty = soup.prettify()
# print(soup_pretty)
# file = open("output.txt", "a")
# file.write(html_text)
# file.close()
I've tried a couple of different options including find_elements(By.CLASS_NAME, "highcharts-a11y-proxy-button highcharts-no-tooltip")
, find_elements(By.CSS_SELECTOR, 'button.highcharts-a11y-proxy-button.highcharts-no-tooltip')
, and find_elements(By.XPATH, //button[starts-with(@button, 'highcharts-a11y-proxy-button')])
but have not had success with any of these methods.
I'd suggest the following approach:
This will expand all of the tables for a specific question.
Then you can extract the information from each table.
import time
from io import StringIO
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import pandas as pd
URL = "https://secure.panoramaed.com/ride/understand/1302972/survey_results/27746568#/questions"
URL_AUTH = "https://secure.panoramaed.com/ride/understand?auth_token=geZrUH8yRr8_Ln_C9LH3"
DRIVER_PATH = "/usr/bin/chromedriver"
options = Options()
service = Service(executable_path=DRIVER_PATH)
driver = webdriver.Chrome(service=service, options=options)
# Authenticate.
driver.get(URL_AUTH)
# Open target page.
driver.get(URL)
time.sleep(5)
questions = driver.find_elements(By.CSS_SELECTOR, ".expandable-row")
for question in questions:
# Open question.
driver.execute_script("arguments[0].click();", question)
time.sleep(3)
# Open context menus.
menus = driver.find_elements(By.XPATH, "//button[@aria-label='View chart menu, Chart']")
for menu in menus:
driver.execute_script("arguments[0].click();", menu)
time.sleep(1)
# Get buttons to display tables.
views = driver.find_elements(By.XPATH, "//li[contains(text(), 'View data table')]")
for view in views:
driver.execute_script("arguments[0].click();", view)
time.sleep(1)
# Now scrape the contents of the revealed tables.
#
tables = driver.find_elements(By.CSS_SELECTOR, ".highcharts-data-table > table")
for table in tables:
df = pd.read_html(StringIO(table.get_attribute("outerHTML")))[0]
print(df)
# Close question.
driver.execute_script("arguments[0].click();", question)
time.sleep(3)
driver.close()
I had trouble with the .click()
method on some of the items, so resorted to doing it via JavaScript.
This is what the output looks like:
Category Responses
0 Not at all excited 1840
1 Slightly excited 2183
2 Somewhat excited 3801
3 Quite excited 1686
4 Extremely excited 726
Category Percentage favorable responses
0 Providence 24
1 Rhode Island 20
Category Providence Rhode Island
0 No 23 19
1 Yes, for part of the day 22 20
2 Yes, for most of the day 33 26
Category Providence Rhode Island
0 Female 21 18
1 Male 26 22
2 Nonbinary 13 15
3 I use another word to describe my gender 28 20
4 I prefer not to answer this question 26 19
Category Providence Rhode Island
0 There is no one in the family or home who is c... 22 20
1 0 days per week 23 18
2 1 or 2 days per week 20 18
3 3 to 5 days per week 31 24
4 6 or 7 days per week 38 28
Obviously, rather than printing the tables to the console you'll want to save them to a file.
This is the contents of requirements.txt
for me:
attrs==23.2.0
beautifulsoup4==4.12.3
bs4==0.0.2
certifi==2024.7.4
h11==0.14.0
idna==3.7
lxml==5.2.2
numpy==2.0.0
outcome==1.3.0.post0
pandas==2.2.2
PySocks==1.7.1
python-dateutil==2.9.0.post0
pytz==2024.1
selenium==4.22.0
six==1.16.0
sniffio==1.3.1
sortedcontainers==2.4.0
soupsieve==2.5
trio==0.26.0
trio-websocket==0.11.1
typing_extensions==4.12.2
tzdata==2024.1
urllib3==2.2.2
websocket-client==1.8.0
wsproto==1.2.0