I would like to use the the Playwright library to web scrape inside Google Cloud Functions.
I am quite a beginner in general with GCP and Python, so really not just looking for a solution, but more to learn best practices.
Assume the very basic scenario of using Playwright to browse to https://www.google.com, take a screenshot and return it to the browser using a get request...
So far, I have tried (without success) - the below:
main.py
import asyncio
from playwright.async_api import async_playwright
from google.cloud import functions_framework
from flask import send_file
import io
@functions_framework.http
def capture_screenshot(request):
screenshot = asyncio.run(take_screenshot())
return send_file(
io.BytesIO(screenshot),
mimetype='image/png',
as_attachment=False,
attachment_filename='screenshot.png'
)
async def take_screenshot():
async with async_playwright() as p:
browser = await p.chromium.launch()
page = await browser.new_page()
await page.goto('https://www.google.com')
screenshot = await page.screenshot(full_page=True)
await browser.close()
return screenshot
requirements.txt
playwright==1.34.0
Flask==2.3.3
google-cloud-functions==1.0.0
The code deploys, but when it runs, I get the below error returned to the browser:
500 Internal Server Error: The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.
Finally when looking through the logs, this is what I get:
[2024-08-11 03:41:08,376] ERROR in app: Exception on / [GET]
Traceback (most recent call last):
File "/layers/google.python.pip/pip/lib/python3.11/site-packages/flask/app.py", line 2190, in wsgi_app
response = self.full_dispatch_request()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/layers/google.python.pip/pip/lib/python3.11/site-packages/flask/app.py", line 1486, in full_dispatch_request
rv = self.handle_user_exception(e)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/layers/google.python.pip/pip/lib/python3.11/site-packages/flask/app.py", line 1484, in full_dispatch_request
rv = self.dispatch_request()
^^^^^^^^^^^^^^^^^^^^^^^
File "/layers/google.python.pip/pip/lib/python3.11/site-packages/flask/app.py", line 1469, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/layers/google.python.pip/pip/lib/python3.11/site-packages/functions_framework/__init__.py", line 99, in view_func
return function(request._get_current_object())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/layers/google.python.pip/pip/lib/python3.11/site-packages/functions_framework/__init__.py", line 80, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/workspace/main.py", line 9, in capture_screenshot
screenshot = asyncio.run(take_screenshot())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/layers/google.python.runtime/python/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/layers/google.python.runtime/python/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/layers/google.python.runtime/python/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/workspace/main.py", line 19, in take_screenshot
browser = await p.chromium.launch()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/layers/google.python.pip/pip/lib/python3.11/site-packages/playwright/async_api/_generated.py", line 14655, in launch
await self._impl_obj.launch(
File "/layers/google.python.pip/pip/lib/python3.11/site-packages/playwright/_impl/_browser_type.py", line 95, in launch
Browser, from_channel(await self._channel.send("launch", params))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/layers/google.python.pip/pip/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 61, in send
return await self._connection.wrap_api_call(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/layers/google.python.pip/pip/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 482, in wrap_api_call
return await cb()
^^^^^^^^^^
File "/layers/google.python.pip/pip/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 97, in inner_send
result = next(iter(done)).result()
^^^^^^^^^^^^^^^^^^^^^^^^^
playwright._impl._api_types.Error: Executable doesn't exist at /www-data-home/.cache/ms-playwright/chromium-1064/chrome-linux/chrome
╔════════════════════════════════════════════════════════════╗
║ Looks like Playwright was just installed or updated. ║
║ Please run the following command to download new browsers: ║
║ ║
║ playwright install ║
║ ║
║ <3 Playwright Team ║
╚════════════════════════════════════════════════════════════╝
Reading other SO posts and from other bits that I have picked up, the issue seems to be the Chromium Driver and there was a suggestion made around creating a Docker directly on Google Run - but not sure how this works - any suggestions or resources would be greatly appreciated
It's likely because playwright expect a specific version of chromium executable that might not exist in your environment. One easy workaround I can think of is to add a subprocess call to install chromium through playwright CLI when the function is called
import subprocess
subprocess.run(["playwright", "install", "chromium"])
so the code would be
import asyncio
from playwright.async_api import async_playwright
from google.cloud import functions_framework
from flask import send_file
import io
import subprocess
@functions_framework.http
def capture_screenshot(request):
subprocess.run(["playwright", "install", "chromium"])
screenshot = asyncio.run(take_screenshot())
return send_file(
io.BytesIO(screenshot),
mimetype='image/png',
as_attachment=False,
attachment_filename='screenshot.png'
)
async def take_screenshot():
async with async_playwright() as p:
browser = await p.chromium.launch()
page = await browser.new_page()
await page.goto('https://www.google.com')
screenshot = await page.screenshot(full_page=True)
await browser.close()
return screenshot
Alternatively, as you mentioned, you can avoid using Google Cloud Functions and instead converting it into a Dockerized Python web application which you can Build and deploy to Cloud Run. The reason for using this approach is that in the Dockerfile of the web application, you can add additional configuration & installation steps on CLI that is not possible through Google Cloud Functions. In this case, you can add the line
RUN pip install -r requirements.txt
RUN playwright install chromium
into the dockerfile so that environment can installs all the Python dependencies along with chromium version compatible with playwright when building the image for the Python web application