Search code examples
pythonhtmlselenium-webdriverweb-scraping

Scraping hierarchical website in a specific category


I am trying to scrape the following page: https://esco.ec.europa.eu/en/classification/skill_main. In particular I would like to click on all plus buttons under S-skills unless there are no more "plus buttons" to click and then save that page source. Now, having found that the plus button is under the CSS selector ".api_hierarchy.has-child-link" when inspecting the page, I have tried as follows:

from selenium.common.exceptions import StaleElementReferenceException

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get("https://esco.ec.europa.eu/en/classification/skill_main")
driver.implicitly_wait(10)

wait = WebDriverWait(driver, 20)

# Define a function to click all expandable "+" buttons
def click_expand_buttons():
    while True:
        try:
            # Find all expandable "+" buttons
            expand_buttons = wait.until(EC.presence_of_all_elements_located(
                (By.CSS_SELECTOR, ".api_hierarchy.has-child-link"))
            )

            # If no expandable buttons are found, we are done
            if not expand_buttons:
                break

            # Click each expandable "+" button
            for button in expand_buttons:
                try:
                    driver.implicitly_wait(10)
                    driver.execute_script("arguments[0].click();", button)
                    # Wait for the dynamic content to load
                    time.sleep(1)
                except StaleElementReferenceException:
                    # If the element is stale, we find the elements again
                    break
        except StaleElementReferenceException:
            continue

# Call the function to start clicking "+" buttons
click_expand_buttons()

html_source = driver.page_source

# Save the HTML to a file
with open("/Users/federiconutarelli/Desktop/escodata/expanded_esco_skills_page.html", "w", encoding="utf-8") as file:
    file.write(html_source)

# Close the browser
driver.quit()

However, the code above keeps closing and opening the + of the say "first level" and this is likely because, with my limited knowledge of scraping, I just asked selenium to click on the plus buttons until there are plus buttons and when the page refreshes to the original one, the script keeps doing it to the infinnity. Now my question is: how can I open all the plus signs (until there are plus signs) only fo S-skills:

<a href="#overlayspin" class="change_right_content" data-version="ESCO dataset - v1.1.2" data-link="http://data.europa.eu/esco/skill/335228d2-297d-4e0e-a6ee-bc6a8dc110d9" data-id="84527">S - skills</a>

?


Solution

  • I think this will help you, didnt tested it. But you put effort on your own code

    I know more XPATH so I change CSS selector to XPATH

    The rest of the code should be same and works

    # Find all expandable "+" buttons
    expand_buttons = wait.until(EC.presence_of_all_elements_located(
        (By.XPATH, "//div[@class='main_item classification_item' and ./a[text()='S - skills']]//span[@class='api_hierarchy has-child-link']"))
        )