Search code examples
pythonseleniumselenium-webdriverbeautifulsoupweb-crawler

Not scrolling down in a website having dynamic scroll


I'm scraping news-articles from a website where there is no load-more button in a specific category page, the news article links are being generated as I scroll down. I wrote a function which take input category_page_url and limit_page(how many times I want to scroll down) and return me back all the links of the news articles displayed in that page.

Category page link = https://www.scmp.com/topics/trade

def get_article_links(url, limit_loading):
    
    options = webdriver.ChromeOptions()
    
    lists = ['disable-popup-blocking']

    caps = DesiredCapabilities().CHROME
    caps["pageLoadStrategy"] = "normal"

    options.add_argument("--window-size=1920,1080")
    options.add_argument("--disable-extensions")
    options.add_argument("--disable-notifications")
    options.add_argument("--disable-Advertisement")
    options.add_argument("--disable-popup-blocking")
    
    driver = webdriver.Chrome(executable_path= r"E:\chromedriver\chromedriver.exe", options=options) #add your chrome path

    
    driver.get(url)
    last_height = driver.execute_script("return document.body.scrollHeight")
    
    loading = 0
    while loading < limit_loading:
        loading += 1
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(8)
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height
        
        
    article_links = []
    bsObj = BeautifulSoup(driver.page_source, 'html.parser')
    for i in bsObj.find('div', {'class': 'content-box'}).find('div', {'class': 'topic-article-container'}).find_all('h2', {'class': 'article__title'}):
        article_links.append(i.a['href'])
    
    return article_links

Assuming I want to scroll 5 times in this category page,

get_article_links('https://www.scmp.com/topics/trade', 5)

But even if I change the number of my limit_page it return me back only the links from first page, there is some mistake I've done to write the scrolling part. Please help me with this.


Solution

  • Instead of scrolling using per body scrollHeight property, I checked to see if there was any appropriate element after the list of articles to scroll to. I noticed this appropriately named div:

    <div class="topic-content__load-more-anchor" data-v-db98a5c0=""></div>
    

    Accordingly, I primarily changed the while loop in your function get_article_links to scroll to this div using location_once_scrolled_into_view after finding the div before the loop starts, as follows:

        loading = 0
        end_div = driver.find_element('class name','topic-content__load-more-anchor')
        while loading < limit_loading:
            loading += 1
            print(f'scrolling to page {loading}...')        
            end_div.location_once_scrolled_into_view
            time.sleep(2)
            
    

    If we now call the function with different limit_loading, we get different count of unique news links. Here are couple of runs:

    >>> ar_links = get_article_links('https://www.scmp.com/topics/trade', 2)
    >>> len(ar_links)
    scrolling to page 1...
    scrolling to page 2...
    
    90
    >>> ar_links = get_article_links('https://www.scmp.com/topics/trade', 3)
    >>> len(ar_links)
    scrolling to page 1...
    scrolling to page 2...
    scrolling to page 3...
    
    120