python selenium selenium-webdriver beautifulsoup web-crawler

Not scrolling down in a website having dynamic scroll

I'm scraping news-articles from a website where there is no load-more button in a specific category page, the news article links are being generated as I scroll down. I wrote a function which take input category_page_url and limit_page(how many times I want to scroll down) and return me back all the links of the news articles displayed in that page.

Category page link = https://www.scmp.com/topics/trade

def get_article_links(url, limit_loading):
    
    options = webdriver.ChromeOptions()
    
    lists = ['disable-popup-blocking']

    caps = DesiredCapabilities().CHROME
    caps["pageLoadStrategy"] = "normal"

    options.add_argument("--window-size=1920,1080")
    options.add_argument("--disable-extensions")
    options.add_argument("--disable-notifications")
    options.add_argument("--disable-Advertisement")
    options.add_argument("--disable-popup-blocking")
    
    driver = webdriver.Chrome(executable_path= r"E:\chromedriver\chromedriver.exe", options=options) #add your chrome path

    
    driver.get(url)
    last_height = driver.execute_script("return document.body.scrollHeight")
    
    loading = 0
    while loading < limit_loading:
        loading += 1
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(8)
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height
        
        
    article_links = []
    bsObj = BeautifulSoup(driver.page_source, 'html.parser')
    for i in bsObj.find('div', {'class': 'content-box'}).find('div', {'class': 'topic-article-container'}).find_all('h2', {'class': 'article__title'}):
        article_links.append(i.a['href'])
    
    return article_links

Assuming I want to scroll 5 times in this category page,

get_article_links('https://www.scmp.com/topics/trade', 5)

But even if I change the number of my limit_page it return me back only the links from first page, there is some mistake I've done to write the scrolling part. Please help me with this.

Solution

Instead of scrolling using per body scrollHeight property, I checked to see if there was any appropriate element after the list of articles to scroll to. I noticed this appropriately named div:

<div class="topic-content__load-more-anchor" data-v-db98a5c0=""></div>

Accordingly, I primarily changed the while loop in your function get_article_links to scroll to this div using location_once_scrolled_into_view after finding the div before the loop starts, as follows:

    loading = 0
    end_div = driver.find_element('class name','topic-content__load-more-anchor')
    while loading < limit_loading:
        loading += 1
        print(f'scrolling to page {loading}...')        
        end_div.location_once_scrolled_into_view
        time.sleep(2)

If we now call the function with different limit_loading, we get different count of unique news links. Here are couple of runs:

>>> ar_links = get_article_links('https://www.scmp.com/topics/trade', 2)
>>> len(ar_links)
scrolling to page 1...
scrolling to page 2...

90
>>> ar_links = get_article_links('https://www.scmp.com/topics/trade', 3)
>>> len(ar_links)
scrolling to page 1...
scrolling to page 2...
scrolling to page 3...

120