I developed and successfully tested a python script locally that uses Playwright for web scraping.
TLDR: Python script logs into a website, goes to a different page, clicks on a few links and then downloads a CSV file, which will ultimately be pushed into a Google Big Query Table, but for the moment, it will simply be returned to the browser for testing purposes...
I created a dockerfile and pushed the container to the Google Artifact Registry. Finally I deployed the service to Google Run.
here is the main.py
from flask import Flask, request, abort, Response
from playwright.async_api import async_playwright
import asyncio
import pandas as pd
import io
app = Flask(__name__)
# Declare the username and password directly in the code
USERNAME = ""
PASSWORD = ""
DASHBOARD = ""
@app.route('/')
def home():
return 'Flask is running! Visit /csvdata?url=<your-url> to capture and view the CSV data.'
@app.route('/csvdata')
def capture_and_display_csv():
# Get the URL from the query string
url = request.args.get('url')
if not url:
return abort(400, description="URL parameter is required")
# Run the function to get the CSV data
csv_data = asyncio.run(get_csv_data(url))
# Convert DataFrame to HTML for display (or just return as plain text)
csv_html = csv_data.to_html() # Or use csv_data.to_string() for plain text
return Response(csv_html, mimetype='text/html')
async def get_csv_data(url):
async with async_playwright() as p:
browser = await p.chromium.launch()
page = await browser.new_page()
page.set_default_timeout(600000)
await page.goto(url, timeout=600000)
# Enter the username and password
await page.fill('#username', USERNAME)
await page.fill('#password input', PASSWORD)
# Click the login button
await page.click('button[type="button"]')
await page.wait_for_selector('div.randomDOM"]', timeout=600000)
await page.goto(DASHBOARD, timeout=600000)
await page.click('a.step1')
await page.click('a.step2')
await page.click('a.step3')
await page.click('a.download_link')
# Wait for the download to complete
download = await page.wait_for_event('download', timeout=600000)
# Save the file content to a variable
csv_content = await download.path()
# Read the CSV content into a pandas DataFrame
df = pd.read_csv(csv_content)
await browser.close()
return df
if __name__ == "__main__":
app.run(host='0.0.0.0', port=8080)
here is the requirements.txt
Flask==2.3.2
playwright==1.39.0
pandas==2.0.3
numpy==1.25.2
As per the question, the problem is that once deployed on Google Run the script times out (in quite a strange fashion, it seems to timeout initially, and then makes a second attempt that also timesout). The logs dont reveal anything that actually fails, in fact as per the below, the logs suggest that the run completed successfully with a 200 document response:
I have cranked up the resources on the Google Run to 8GIG Memory, 8CPUS and 3600 Request Timeout (max), and also as per the code above, explicitly put timeout of page and all wait functions to 10 mins - none of these has yielded a result. Hoping someone knows how to make this work or has any ideas.
The solution here is to utilise the Google Cloud Run Job service and redeploy the app as stand alone function and not a Get Request App. This has 2 positive impacts on performance: