Search code examples
javascriptpythonseleniumweb-scrapingcensus

Interacting with javascript scrollable container from python/selenium


I am trying to use Selenium/Python to automate downloading datasets from http://factfinder.census.gov. I am new to Javascript, so apologies if this is an easily resolved problem. I am working on the beginning portion of the code now, and it should:

  1. Go here
  2. Click the "Topics" button
  3. Once "Topics" is clicked and the new page loads, click on "Dataset"
  4. Select the datasets I need, ideally by indexing the (sub) table.

I am stuck at step 3. Here is a screenshot; seems I want to access the div w/id "scrollable_container_topics" and then either iterate through or index to get its child nodes (in this case, I want the last child node). I have tried using script_execute and then locating the element by id and also by class name, but nothing has worked so far. I'd be grateful for any pointers.

enter image description here

Here is my code:

import os
import re
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support.select import Select


# A list of all the variables we want to extract; corresponds to "Topics" field on site
topics = ["B03003", "B05001"]

# A list of all the states we want to extract data for (currently, strings; is there a numeric code?)
states = ["New Jersey", "Georgia"]

# A vector of all the years we want to extract data for [lower, upper) *Note* this != range of years covered by data
years = range(2009, 2010)

# Define the class
class CensusSearch:

    # Initialize and set attributes of the query
    def __init__(self, topic, state, year):

        """
        :type topic: str
        :type state: str
        :type year: int
        """
        self.topic = topic
        self.state = state
        self.year = year


    def setUp(self):

       # self.driver = webdriver.Chrome("C:/Python34/Scripts/chromedriver.exe")
        self.driver = webdriver.Firefox()

    def extractData(self):
        driver = self.driver
        driver.set_page_load_timeout(1000000000000)
        driver.implicitly_wait(100)

        # Navigate to site; this url = after you have already chosen "Advanced Search"
        driver.get("http://factfinder.census.gov/faces/nav/jsf/pages/searchresults.xhtml?refresh=t")
        driver.implicitly_wait(10)

        # FIlter by dataset (want the ACS 1, 3, and 5-year estimates)

        driver.execute_script("document.getElementsByClassName('leftnav_btn')[0].click()") # click the "Topics" button
        driver.implicitly_wait(20) 

        # This is where I am stuck; I've tried the following: 
        getData = driver.find_element_by_id("ygtvlabelel172")
        getData.click()
        driver.implicitly_wait(10)


        # Filter geographically: select all counties in the United States and Puerto Rico
        # Click "Geographies" button
        driver.execute_script("document.getElementsByClassName('leftnav_btn')[1].click()")
        driver.implicitly_wait(10)

        drop_down = driver.find_element_by_class_name("popular_summarylevel")
        select_box = Select(drop_down)
        select_box.select_by_value("050")

    # Once "Geography" is clicked, select "County - 050" from the drop-down menu; then select "All US + Puerto Rico"
    drop_down_counties = driver.find_element_by_id("geoAssistList")
    select_box_counties = Select(drop_down_counties)
    select_box_counties.select_by_index(1)

    # Click the "ADD TO YOUR SELECTIONS" button
    driver.execute_script("document.getElementsByClassName('button-g')[0].click()")
    driver.implicitly_wait(10)

    def tearDown(self):
        self.driver.quit()

    def main(self):
        #print(getattr(self))
        print(self.state)
        print(self.topic)
        print(self.year)
        self.setUp()
        self.extractData()
        self.tearDown()


for a in topics:
    for b in states:
        for c in years:
            query = CensusSearch(a, b, c)
            query.main()

print("done")

Solution

  • Several things to fix:

    • you don't have to use document.getElement.. methods - selenium has it's own methods to locate elements on a page
    • there is no need to manipulate implicit waits (plus, make sure you understand that calling implicitly_wait() would not behave as a time.sleep() - you would not get an immediate time delay) or page load timeouts in this case - just use Explicit Waits before you perform actions on the page

    Here is a working code that clicks "Topics" and then "Dataset":

    from selenium import webdriver
    from selenium.webdriver import ActionChains
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.wait import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    
    driver = webdriver.Firefox()
    driver.get("http://factfinder.census.gov/faces/nav/jsf/pages/searchresults.xhtml?refresh=t")
    
    wait = WebDriverWait(driver, 10)
    actions = ActionChains(driver)
    
    # click "Topics"
    topics = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a#topic-overlay-btn")))
    driver.execute_script("arguments[0].click();", topics)
    
    # click "Dataset"
    dataset = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "span[title=Dataset]")))
    dataset.click()