Search code examples
javascriptpythonselenium-webdriverweb-scrapingselenium-chromedriver

Scraping A Website With An Embedded Javascript With Selenium


I am new to Selenium and trying to scrape the contents of this website. But, the site seems to be based on a template and a Javascript that is run to populate it and I don't know how to access the contents that I see, like the title (Auf dem Bahnhof) or the Objective, etc. using Selenium.

I can locate the tags of elements that I need by browsing the Web Developer Tools, but they return nothing after I run my sample script below:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import Select,WebDriverWait


class Demo():

    def demo_get_contents(self):

        # create webdriver object
        service = Service(executable_path=ChromeDriverManager().install())
        driver = webdriver.Chrome(service=service)

        driver.get('https://gloss.dliflc.edu/LessonViewer.aspx?lessonId=26143&lessonName=ger_soc434&linkTypeId=0')
        element = WebDriverWait(driver, 2).until(EC.visibility_of_all_elements_located((By.CLASS_NAME,'gloss_Overview')))
        print(element.get_attribute('text'))


demo = Demo()
demo.demo_get_contents()

I am using Python3.8

Looking at the Page Source, I can see the Javascript and the iframe that presumably runs the accessActivity() function, but don't know how to run that using Selenium to access the actual page contents.


Solution

  • Actually, as an alternative, there's no need to use Selenium. If you inspect the Network calls, you'll see that the data is available as an XML file from

    https://gloss.dliflc.edu/GlossHtml/templates/linksLO/glossLOs/ger_soc434.xml
    

    You can use Python's built ElementTree library to scrape the correct Quiz data.

    import requests
    import xml.etree.ElementTree as ET
    
    
    url = 'https://gloss.dliflc.edu/GlossHtml/templates/linksLO/glossLOs/ger_soc434.xml'
    
    
    def get_element_text(element):
        return ''.join(element.itertext()).strip()
    
    
    def find_elements_texts(root, tag):
        elements = root.findall(f".//{tag}[@dir='ltr'][@esbox='0']")
        return [get_element_text(elem) for elem in elements]
    
    
    response = requests.get(url).content
    root = ET.fromstring(response)
    
    objectives_texts = find_elements_texts(root, "OBJECTIVES")
    descriptions_texts = find_elements_texts(root, "ACTY_DESCRIPTION")
    
    print(f"Objective:\n {''.join(objectives_texts)}\n")
    
    print(f"Descriptions:\n {descriptions_texts}")
    

    Prints:

    Objective:
     Strengthen listening skills and improve comprehension by focusing on terms related to train travel in an audio about a family at a train station before a trip.
    
    Descriptions:
     ['Identify relevant vocabulary and get a more detailed idea of the topic.', 'Preview useful terms and expressions that appear in the upcoming dialogue.', 'Become familiar with the specifics of the situation by listening to several dialogues.', 'Transcribe portions of another dialogue.', 'Assess your knowledge by matching questions with answers.']