Search code examples
pythonpython-3.xseleniumselenium-chromedriverpageload

How to load in the entirety of a website for selenium to collect data from, and keep everything loaded in?


I am trying to scrape the terms and definitions, using the selenium chrome driver in python, from this website here: https://quizlet.com/433328443/ap-us-history-flash-cards/. There are 533 terms...so many in fact that quizlet makes you click a See more button if you want to see all the terms. The following code successfully extracts terms and definitions (I have tested it on other quizlet sites with less terms). There are also if() statements to deal with popups and the See more button. Again, my goal is to get the terms and definitions for every single term-definition pair on the page; however, to do this, the entire page needs to be loaded in, which is the basis of my problem.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome(executable_path = chrome_driver_path)
driver.get("https://quizlet.com/433328443/ap-us-history-flash-cards/")
# INCASE OF POPUP, CLICK AWAY
if len(driver.find_elements_by_xpath("//button[@class='UILink UILink--revert']")) > 0:
    popup = driver.find_element_by_xpath("//button[@class='UILink UILink--revert']")
    popup.click()
    del popup

# SCROLL TO BOTTOM TO LOAD IN ALL TERMS, AND THEN BACK TO THE TOP
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# INCASE OF "SEE MORE" BUTTON AT BOTTOM, CLICK IT
if len(driver.find_elements_by_xpath("//button[@class='UIButton UIButton--fill' and @aria-label='See more']")) > 0:
    see_more = driver.find_element_by_xpath("//button[@class='UIButton UIButton--fill' and @aria-label='See more']")
    see_more.click()
    del see_more
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# list of terms
quizlet_terms = tuple(map(lambda a: a.text,
                          driver.find_elements_by_class_name("SetPageTerm-wordText")))
            
                
# list of definitions
quizlet_definitions = tuple(map(lambda a: a.text, 
                                driver.find_elements_by_class_name("SetPageTerm-definitionText")))

In my code, I have tried the scrolling down trick to load in everything, but this does not work. This is because as I scroll down, while terms in my browser window are loaded, terms above and below my browser window get unloaded. Obviously, this is done for memory reasons, but I do not care about memory and I just want for all the terms to be loaded at once so I can access their contents. My code works on smaller quizlet sites (with say 100 terms), but it breaks on this site, generating the following error:

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document

This stackoverflow page explains the error message: Python with Selenium "element is not attached to the page document".

From reading the aforementioned page, I have come to the conclusion that because the website is so large, as I scroll down the quizlet page, the terms I am currently looking at in my browser window are loaded, but terms that I have scrolled past and are no longer in my view are unloaded and stored in some funky way that I cannot properly access, generating the error message.

How would one go about in keeping the entirety of the page loaded-in so I can access the contents of all 533 terms? Ideally, I would like a solution that keeps everything I have scrolled past fully-loaded in, and does not unload anything. Another idea is that the whole page is loaded in from the get-go. It would also be nice if there is some memory-saving solution to this, perhaps by simply accessing just the raw html code and no fancy graphics or anything. Has anyone ever encountered this problem, and if so, how did you solve it? Thank you, any help is appreciated.


Solution

  • Much thanks to @Abhishek Dhoundiyal's comment. My working code:

    driver.execute_script("window.scrollTo(800, 800);")
    terms_in_this_set = int(sub("\D", "", (driver.find_element_by_xpath("//h4[@class='UIHeading UIHeading--assembly UIHeading--four']")).text))
    chunk_size = 15000
    
        
    quizlet = numpy.empty(shape = (0, 2), dtype = "str")
        
    # done in while loop so that terms and definitions can be extracted while scrolling (while making sure there are no duplicate entries)
    while len(quizlet) != terms_in_this_set:
            
    
            
        # INCASE OF "SEE MORE" BUTTON, CLICK IT TO SEE MORE
        if len(driver.find_elements_by_xpath("//button[@class='UIButton UIButton--fill' and @aria-label='See more']")) > 0:
            see_more = driver.find_element_by_xpath("//button[@class='UIButton UIButton--fill' and @aria-label='See more']")
            see_more.click()
            del see_more
            
            
        # CHECK IF THERE ARE TERMS
        quizlet_terms_classes = driver.find_elements_by_class_name("SetPageTerm-wordText")
        quizlet_definitions_classes = driver.find_elements_by_class_name("SetPageTerm-definitionText")
        if (len(quizlet_terms_classes) > 0) and (len(quizlet_definitions_classes) > 0):
                
            # append current iteration terms and definitions to full quizlet terms and definitions
            quizlet = numpy.vstack((quizlet, numpy.transpose([list(map(lambda term: remove_whitespace(term.text), quizlet_terms_classes)), list(map(lambda definition: remove_whitespace(definition.text), quizlet_definitions_classes))])))
            # get unique rows
            quizlet = numpy.unique(quizlet, axis = 0)
                
        del quizlet_terms_classes, quizlet_definitions_classes
    
    
        driver.execute_script(f"window.scrollBy(0, {chunk_size})")
            
        
        
    del terms_in_this_set