Search code examples
pythonselenium-webdriverbeautifulsouppython-requestsplaywright

How can I scrape data from a website into a CSV file using Python Playwright (or alternatives) while avoiding access errors and improving speed?


I'm trying to scrape data from this website using Python and Playwright, but I'm encountering a few issues. The browser runs in non-headless mode, and the process is very slow. When I tried other approaches, like using requests and BeautifulSoup, I ran into access issues, including 403 Forbidden and 404 Not Found errors. My goal is to scrape all pages efficiently and save the data into a CSV file.

Here’s the code I’m currently using:

import asyncio
from playwright.async_api import async_playwright
import pandas as pd
from io import StringIO

URL = "https://www.coingecko.com/en/coins/1/markets/spot"

async def fetch_page(page, url):
    print(f"Fetching: {url}")
    await page.goto(url)
    await asyncio.sleep(5)
    return await page.content()

async def scrape_all_pages(url, max_pages=10):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False, slow_mo=2000)
        context = await browser.new_context(viewport={"width": 1280, "height": 900})
        page = await context.new_page()

        markets = []
        for page_num in range(1, max_pages + 1):
            html = await fetch_page(page, f"{url}?page={page_num}")
            dfs = pd.read_html(StringIO(html))  # Parse tables
            markets.extend(dfs)

        await page.close()
        await context.close()
        await browser.close()

    return pd.concat(markets, ignore_index=True)

def run_async(coro):
    try:
        loop = asyncio.get_running_loop()
    except RuntimeError:
        loop = None

    if loop and loop.is_running():
        return asyncio.create_task(coro)
    else:
        return asyncio.run(coro)

async def main():
    max_pages = 10
    df = await scrape_all_pages(URL, max_pages)
    df = df.dropna(how='all')
    print(df)

run_async(main())

The primary issues are the slow speed of scraping and the access errors when using alternatives to Playwright. I'm looking for advice on how to improve this approach, whether it’s by optimizing the current code, handling access restrictions like user-agent spoofing or proxies, or switching to a different library entirely. Any suggestions on how to make the process faster and more reliable would be greatly appreciated. Thank you.


Solution

  • Before even writing your scraping code, always take the time to understand the webpage.

    In this case, based on viewing the page source and looking through the network tab in dev tools, there's nothing dynamic at all here.

    My first instinct was to use simple HTTP requests with a user agent, but these get blocked by the server, so Playwright is a reasonable option.

    But since the data is static, there's no need to sleep, and you can also go headless (with a user agent), disable JS and use the fastest navigation predicate, "commit".

    Here's an initial rewrite:

    import asyncio
    import pandas as pd  # 2.2.2
    from io import StringIO
    from playwright.async_api import async_playwright  # 1.48.0
    
    
    URL = "<Your URL>"
    UA = (
        "Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/130.0.0.0 Mobile Safari/537.3"
    )
    
    
    async def scrape_all_pages(base_url, max_pages=10):
        markets = []
    
        async with async_playwright() as p:
            browser = await p.chromium.launch()
            context = await browser.new_context(user_agent=UA, java_script_enabled=False)
            page = await context.new_page()
    
            for page_num in range(1, max_pages + 1):
                url = f"{base_url}?page={page_num}"
                await page.goto(url, wait_until="commit")
                html = await page.content()
                markets.extend(pd.read_html(StringIO(html)))
    
        return pd.concat(markets, ignore_index=True)
    
    
    async def main():
        df = await scrape_all_pages(URL, max_pages=10)
        df = df.dropna(how="all")
        print(df)
    
    
    asyncio.run(main())
    

    Original time:

    real 1m17.661s
    user 0m3.348s
    sys  0m1.190s
    

    Rewrite time:

    real 0m6.912s
    user 0m1.417s
    sys  0m0.785s
    

    We have a 13x speedup. The biggest improvement was removing the unnecessary sleeps.

    If you're scraping more pages than this, adding a task queue to increase parallelism can help. But for only 10 pages the overhead and code complexity of adding parallelism isn't worth it, so I'll skip that for now.

    Also, you may have added sleeps to avoid server rate limiting. That's a good reason to sleep, but then you won't be able to get a speedup. Using a residential proxy cluster would bypass this, but it's a lot of hassle to set up, also out of scope of the question.