python web-scraping playwright playwright-python

Playwright Python gives inconsistent scraping results

My Playwright Python script goes to LeetCode problems url, Selects "Top 100 Liked Questions", scraps all the "Problem names" and goes to "Next page" until "Next page" button is disabled and keeps printing out the "Problem names".

The problem is that Sometimes on Page 2, I get Page 1's problems names, sometimes I get empty results, and sometimes I get all of the correct results(when I put slow_mo=1000).

Here's the code

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    count = 1 #Page Count
    browser = p.chromium.launch(slow_mo=1000)
    page = browser.new_page()
    page.goto("https://leetcode.com/problemset/all/")
    page.get_by_role("button", name="Lists").click()
    page.get_by_text("Top 100 Liked Questions").click()
    while True:
        print(f'\nPage {count}\n')
        count += 1 #Increment Page Count

        #Current Page Operations
        nodes = page.query_selector_all("a.h-5")
        for node in nodes:
            print(node.inner_text())

        #Next Page
        if(page.get_by_label("next", exact=True).is_enabled() == True):
            page.get_by_label("next", exact=True).click()
        else:
            break
    
    browser.close()

when you go to the url manually, and select Top 100 Liked questions, there are 2 pages only and both have the 100 top liked problems.

My script somehow sometimes returns empty, sometimes returns page 1's problems again on page 2, the results are very inconsistent, when I add slow_mo=1000 then the errors are less frequent but still they appear.

How do I ensure accurate results 100% of the time?

Solution

The issue is that each click triggers a navigation, but instead of waiting for the navigation to resolve, your while loop immediately starts the next iteration and begins pulling down the data regardless of whether it's been updated or not. This can cause results to appear twice and/or not appear at all if the next chunk loads too slowly.

Using locators over query selectors is preferred, but not enough to determine when the page has changed. You could wait for the URL to change, or wait until your current/last page's worth of results changes, since it's a pretty safe assumption that each page's results are unique.

from playwright.sync_api import sync_playwright # 1.37.0


with sync_playwright() as p:
    def scrape_problems():
        loc = page.locator('[role="rowgroup"] [href^="/problems/"]')
        loc.first.wait_for()
        problems = [x for x in loc.all_text_contents() if x]

        if len(problems) == 51:
            problems.pop(0) # get rid of the "daily" problem

        return problems

    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://leetcode.com/problemset/all/")
    problems = []
    chunk = scrape_problems()
    page.get_by_role("button", name="Lists").click()
    page.get_by_text("Top 100 Liked Questions").click()

    while True:
        # caution: no timeout/retry limit on this loop
        while (next_chunk := scrape_problems()) == chunk:
            pass

        problems.extend(chunk := next_chunk)

        if page.get_by_label("next", exact=True).is_enabled():
            page.get_by_label("next", exact=True).click()
        else:
            break

    for p in problems:
        print(p)

    print("count:", len(problems))
    browser.close()

Note that "Top 100 Liked Questions" only has 100 results, so you could click on the "100 / page" dropdown and skip pagination entirely.