Search code examples
selenium-webdriverpagination

Web scraping with selenium fails to paginate


I am trying to scrape this webpage https://mst.dk/publikationer, it has pagination and looking at the source, it looks like it is happening in the section I've added below.

<div class="Container_Container__G5vVd Container_Container___width_std__y2_Pn">
    <div class="Pagination_Pagination_wrapper__kp62j">
        <ul class="Pagination_Pagination__UOZ60" role="navigation" aria-label="Pagination">
            <li class="Pagination_Pagination_prev__zIUqn Pagination_Pagination_item___disabled__g5CaR">
                <a class="Pagination_Pagination_link__Z2LW0 Pagination_Pagination_prevLink__HDKS4" tabindex="-1" role="button" aria-disabled="true" aria-label="Previous page" rel="prev"></a>
            </li>
            <li class="Pagination_Pagination_item__suqyV selected">
                <a rel="canonical" role="button" class="Pagination_Pagination_link__Z2LW0 Pagination_Pagination_link___active__to_Os" tabindex="-1" aria-label="Side 1" aria-current="page">1</a>
            </li>
            <li class="Pagination_Pagination_item__suqyV">
                <a role="button" class="Pagination_Pagination_link__Z2LW0" tabindex="0" aria-label="Side 2" rel="next">2</a>
            </li>
            <li class="Pagination_Pagination_break__dKVzB">
                <a class="Pagination_Pagination_breakLink__jB8Rd" role="button" tabindex="0">...</a>
            </li>
            <li class="Pagination_Pagination_item__suqyV">
                <a role="button" class="Pagination_Pagination_link__Z2LW0" tabindex="0" aria-label="Side 321">321</a>
            </li>
            <li class="Pagination_Pagination_next__N6tkt">
                <a class="Pagination_Pagination_link__Z2LW0 Pagination_Pagination_nextLink__mytrA" tabindex="0" role="button" aria-disabled="false" aria-label="Next page" rel="next"></a>
            </li>
        </ul>
    </div>

I've tried multiple approaches including adding page=x to the url, or using selenium different locators and selectors, increasing wait time, trying to use next button, or imitate a click on list items. Nothing seems to be woking for me. Can anybody please help me figuring out the dynamics of this page and how to paginate through it? What I am trying to do is open each link in each page and find the pdf and download it, which works fine for the first page, using the code below:

def parse_epa_filtered_keywords():
    # Get number of search results
    page_no = int(int(get_number_of_results(link_filtered)) / 10) + 1
    driver = webdriver.Chrome(options=options)
    search_query = '+'.join(keywords.split())
    
    for i in tqdm(range(1, page_no + 1)):
        try:
            search_url = f"{link_filtered}?search={search_query}&page={i}"
            print(f"Fetching URL: {search_url}")
            
            # Load the search URL
            driver.get(search_url)
            
            # Wait for the page to load completely
            time.sleep(5)  # Adjust the sleep time as needed
            
            # Wait for the main page to load again
            publications = driver.find_elements(By.CSS_SELECTOR, 'a[class^="Link_Link__lzynb SearchResultItem_SearchResult"]')
            ....
driver.quit()

Obviously it is the effort using the page, which keeps opening the first page over and over. then I tried to use the following items:

next_button = driver.find_element(By.XPATH, "//li[contains(@class, 'Pagination_Pagination_next')]/a[@rel='next']")

or

next_button = WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "li.Pagination_Pagination_next_N6tkt a")))

and many more tries with different elements, which either lead to a general chrome driver error, or something like :

An error occurred: Message: element click intercepted: Element is not clickable at point (732, 2911)
  (Session info: chrome=128.0.6613.114)
Stacktrace:
0   chromedriver                        0x0000000104f83998 cxxbridge1$str$ptr + 1887096
1   chromedriver                        0x0000000104f7be00 cxxbridge1$str$ptr + 1855456
2   chromedriver                        0x0000000104b80be0 cxxbridge1$string$len + 89508
3   chromedriver                        0x0000000104bca6fc cxxbridge1$string$len + 391360
4   chromedriver                        0x0000000104bc8d28 cxxbridge1$string$len + 384748
5   chromedriver        

Solution

  • next_button = driver.find_element(By.XPATH, "//li[contains(@class, 'Pagination_Pagination_next')]/a[@rel='next']")
    

    Although the XPath expression in your above code is correct, for some reason it is not clicking the element. I used ActionChains as below and it successfully clicked the next button.

    next_button = wait.until(EC.element_to_be_clickable((By.XPATH, "//a[@aria-label='Next page']")))
    actions = ActionChains(driver)
    actions.move_to_element(next_button).click().perform()
    

    Here is a full working code which will scrape the pages in a loop.

    Note: I am scraping the first 3 pages and scraping the search results headings you can scrape whatever you want:

    from selenium.webdriver import ActionChains
    from selenium import webdriver
    from selenium.webdriver.support.wait import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.common.by import By
    
    def click_next_page():
        next_button = wait.until(EC.element_to_be_clickable((By.XPATH, "//a[@aria-label='Next page']")))
        actions = ActionChains(driver)
        actions.move_to_element(next_button).click().perform()
    
    def extract_headings(wait):
        headings = wait.until(EC.visibility_of_all_elements_located((By.XPATH, "//li//h3")))
        search_results_headings = ""
        for heading in headings:
            search_results_headings += "\n" + heading.text
        return search_results_headings
    
    driver = webdriver.Chrome()
    driver.get("https://mst.dk/publikationer")
    driver.maximize_window()
    wait = WebDriverWait(driver, 10)
    
    # Use below line of code only if you see accept/reject cookies pop-up
    accept_all = wait.until(EC.element_to_be_clickable((By.ID, "CybotCookiebotDialogBodyLevelButtonLevelOptinAllowAll")))
    driver.execute_script("arguments[0].click();", accept_all)
    
    search_results_headings = ""
    # Below for loop iterates 3 times, so 3 pages will be scraped, if you want more pages change the range accordingly
    for _ in range(3):
        search_results_headings += extract_headings(wait)
        click_next_page()
    
    print(search_results_headings)
    

    Console output:

    Diffus forurening med PFAS i jord, grundvand og overfladevand
    Digitale værktøjer til klimatilpasning
    Performancebenchmarking
    Oprensning af PFAS-forurening i jord, slam og vand - Test af teknologier i praksis
    Lokalt funderede analyse – afrapportering
    Maritime Emissionsløsninger i Kystnære Farvande
    Biokinetisk lattergasreduktion i renseanlæg
    Inter DAN NRW
    Gennemførelse og anvendelse af slamdirektivet 2023
    CombiControl - Combining above- and belowground biological control agents for improved pest control in strawberry tunnel production
    Affaldsstatistik 2022
    Scientific investigation of ballast water discharge - Random checks on ships in autumn – winter 2022
    Control of Biocides 2023
    Ny kosteffektiv teknologi til måling af klimagasudledninger fra renseanlæg
    Recycling potential of separately collected post-consumer textile waste
    Modelling and mapping ­pesticide exposure risk at the catchment scale (MOMAPEST)
    Indberetning af status for anvendelse af almene vandforsyningsboringer i Virk.dk
    PFAS i jord - International screening af andre landes praksis for håndtering af jord med PFAS
    Anbefalinger til screening og kortlægning af bygge- og anlægsaffald
    Emissions of Quaternary Alkylam­monium Compounds
    Nikotinposer – indhold og miljøkonsekvenser
    Udredningsprojekt vedr. analysemetoder til undersøgelse for PFAS-forbindelser i jord, grundvand og overfladevand
    Rensningsmuligheder for pesticider med fokus på aktivt kul og membraner
    Renholds- og omkostningsanalyse jf. Engangsplastdirektivets oprydningsansvar
    Kemiske stoffer i en cirkulær økonomi - Et MUDP projekt
    Pesticider og biocider i den danske pindsvinebestand
    Kortlægning af madaffald i primærproduktionen samt forarbejdnings- og fremstillingssektoren for 2022
    Kortlægning af madaffald og madspild i restaurationsbranchen og restaurationstjenester for 2022
    Inhibition of lung surfactant function as an alternative method to predict lung toxicity following exposure to plant protection products
    Survey and risk assessment of pesticides in cut flowers from non-EU countries
    
    Process finished with exit code 0