Search code examples
pythonweb-scrapinggoogle-cloud-functionsplaywright

Web Scraping in Python from Google Cloud Functions


I would like to use the the Playwright library to web scrape inside Google Cloud Functions.

I am quite a beginner in general with GCP and Python, so really not just looking for a solution, but more to learn best practices.

Assume the very basic scenario of using Playwright to browse to https://www.google.com, take a screenshot and return it to the browser using a get request...

So far, I have tried (without success) - the below:

main.py

import asyncio
from playwright.async_api import async_playwright
from google.cloud import functions_framework
from flask import send_file
import io

@functions_framework.http
def capture_screenshot(request):
    screenshot = asyncio.run(take_screenshot())
    return send_file(
        io.BytesIO(screenshot),
        mimetype='image/png',
        as_attachment=False,
        attachment_filename='screenshot.png'
    )

async def take_screenshot():
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        await page.goto('https://www.google.com')
        screenshot = await page.screenshot(full_page=True)
        await browser.close()
        return screenshot

requirements.txt

playwright==1.34.0
Flask==2.3.3
google-cloud-functions==1.0.0

The code deploys, but when it runs, I get the below error returned to the browser:

500 Internal Server Error: The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.

Finally when looking through the logs, this is what I get:

[2024-08-11 03:41:08,376] ERROR in app: Exception on / [GET]
Traceback (most recent call last):
  File "/layers/google.python.pip/pip/lib/python3.11/site-packages/flask/app.py", line 2190, in wsgi_app
    response = self.full_dispatch_request()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/layers/google.python.pip/pip/lib/python3.11/site-packages/flask/app.py", line 1486, in full_dispatch_request
    rv = self.handle_user_exception(e)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/layers/google.python.pip/pip/lib/python3.11/site-packages/flask/app.py", line 1484, in full_dispatch_request
    rv = self.dispatch_request()
         ^^^^^^^^^^^^^^^^^^^^^^^
  File "/layers/google.python.pip/pip/lib/python3.11/site-packages/flask/app.py", line 1469, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/layers/google.python.pip/pip/lib/python3.11/site-packages/functions_framework/__init__.py", line 99, in view_func
    return function(request._get_current_object())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/layers/google.python.pip/pip/lib/python3.11/site-packages/functions_framework/__init__.py", line 80, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/main.py", line 9, in capture_screenshot
    screenshot = asyncio.run(take_screenshot())
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/layers/google.python.runtime/python/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/layers/google.python.runtime/python/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/layers/google.python.runtime/python/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/workspace/main.py", line 19, in take_screenshot
    browser = await p.chromium.launch()
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/layers/google.python.pip/pip/lib/python3.11/site-packages/playwright/async_api/_generated.py", line 14655, in launch
    await self._impl_obj.launch(
  File "/layers/google.python.pip/pip/lib/python3.11/site-packages/playwright/_impl/_browser_type.py", line 95, in launch
    Browser, from_channel(await self._channel.send("launch", params))
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/layers/google.python.pip/pip/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 61, in send
    return await self._connection.wrap_api_call(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/layers/google.python.pip/pip/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 482, in wrap_api_call
    return await cb()
           ^^^^^^^^^^
  File "/layers/google.python.pip/pip/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 97, in inner_send
    result = next(iter(done)).result()
             ^^^^^^^^^^^^^^^^^^^^^^^^^
playwright._impl._api_types.Error: Executable doesn't exist at /www-data-home/.cache/ms-playwright/chromium-1064/chrome-linux/chrome
╔════════════════════════════════════════════════════════════╗
║ Looks like Playwright was just installed or updated.       ║
║ Please run the following command to download new browsers: ║
║                                                            ║
║     playwright install                                     ║
║                                                            ║
║ <3 Playwright Team                                         ║
╚════════════════════════════════════════════════════════════╝

Reading other SO posts and from other bits that I have picked up, the issue seems to be the Chromium Driver and there was a suggestion made around creating a Docker directly on Google Run - but not sure how this works - any suggestions or resources would be greatly appreciated


Solution

  • It's likely because playwright expect a specific version of chromium executable that might not exist in your environment. One easy workaround I can think of is to add a subprocess call to install chromium through playwright CLI when the function is called

    import subprocess
    subprocess.run(["playwright", "install", "chromium"])
    

    so the code would be

    import asyncio
    from playwright.async_api import async_playwright
    from google.cloud import functions_framework
    from flask import send_file
    import io
    import subprocess
    
    @functions_framework.http
    def capture_screenshot(request):
        subprocess.run(["playwright", "install", "chromium"])
        screenshot = asyncio.run(take_screenshot())
        return send_file(
            io.BytesIO(screenshot),
            mimetype='image/png',
            as_attachment=False,
            attachment_filename='screenshot.png'
        )
    
    async def take_screenshot():
        async with async_playwright() as p:
            browser = await p.chromium.launch()
            page = await browser.new_page()
            await page.goto('https://www.google.com')
            screenshot = await page.screenshot(full_page=True)
            await browser.close()
            return screenshot
    

    Alternatively, as you mentioned, you can avoid using Google Cloud Functions and instead converting it into a Dockerized Python web application which you can Build and deploy to Cloud Run. The reason for using this approach is that in the Dockerfile of the web application, you can add additional configuration & installation steps on CLI that is not possible through Google Cloud Functions. In this case, you can add the line

    RUN pip install -r requirements.txt
    RUN playwright install chromium
    

    into the dockerfile so that environment can installs all the Python dependencies along with chromium version compatible with playwright when building the image for the Python web application