Search code examples
pythonweb-scrapinggoogle-cloud-platformplaywrightgoogle-cloud-run

Playwright works locally, but fails once deployed on to Google Run


I developed and successfully tested a python script locally that uses Playwright for web scraping.

TLDR: Python script logs into a website, goes to a different page, clicks on a few links and then downloads a CSV file, which will ultimately be pushed into a Google Big Query Table, but for the moment, it will simply be returned to the browser for testing purposes...

I created a dockerfile and pushed the container to the Google Artifact Registry. Finally I deployed the service to Google Run.

here is the main.py

from flask import Flask, request, abort, Response
from playwright.async_api import async_playwright
import asyncio
import pandas as pd
import io

app = Flask(__name__)

# Declare the username and password directly in the code
USERNAME = ""
PASSWORD = ""
DASHBOARD = ""

@app.route('/')
def home():
    return 'Flask is running! Visit /csvdata?url=<your-url> to capture and view the CSV data.'

@app.route('/csvdata')
def capture_and_display_csv():
    # Get the URL from the query string
    url = request.args.get('url')
    
    if not url:
        return abort(400, description="URL parameter is required")
    
    # Run the function to get the CSV data
    csv_data = asyncio.run(get_csv_data(url))
    
    # Convert DataFrame to HTML for display (or just return as plain text)
    csv_html = csv_data.to_html()  # Or use csv_data.to_string() for plain text
    
    return Response(csv_html, mimetype='text/html')

async def get_csv_data(url):
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        page.set_default_timeout(600000)
        
        await page.goto(url, timeout=600000)
        
        # Enter the username and password
        await page.fill('#username', USERNAME)
        await page.fill('#password input', PASSWORD)
        
        # Click the login button
        await page.click('button[type="button"]')        
        await page.wait_for_selector('div.randomDOM"]', timeout=600000)   
        await page.goto(DASHBOARD, timeout=600000)

        await page.click('a.step1')
        await page.click('a.step2')
        await page.click('a.step3')
        await page.click('a.download_link')
        
        # Wait for the download to complete
        download = await page.wait_for_event('download', timeout=600000)
        
        # Save the file content to a variable
        csv_content = await download.path()
        
        # Read the CSV content into a pandas DataFrame
        df = pd.read_csv(csv_content)
        await browser.close()
        return df

if __name__ == "__main__":
    app.run(host='0.0.0.0', port=8080)

here is the requirements.txt

Flask==2.3.2
playwright==1.39.0
pandas==2.0.3
numpy==1.25.2

As per the question, the problem is that once deployed on Google Run the script times out (in quite a strange fashion, it seems to timeout initially, and then makes a second attempt that also timesout). The logs dont reveal anything that actually fails, in fact as per the below, the logs suggest that the run completed successfully with a 200 document response:

enter image description here

I have cranked up the resources on the Google Run to 8GIG Memory, 8CPUS and 3600 Request Timeout (max), and also as per the code above, explicitly put timeout of page and all wait functions to 10 mins - none of these has yielded a result. Hoping someone knows how to make this work or has any ideas.


Solution

  • The solution here is to utilise the Google Cloud Run Job service and redeploy the app as stand alone function and not a Get Request App. This has 2 positive impacts on performance:

    1. No need for flask
    2. Longer and more persistent execution of scripts and hence less timeout issues