python selenium-webdriver selenium-chromedriver mwaa

Selenium Webdriver unexpectedly exits in AWS MWAA

I'm trying to run selenium periodically within AWS MWAA but chromium crashes with status code -5 every single time. I've tried to google this status code without success. Any ideas as to what's causing this error? Alternatively, how should I be running selenium with AWS MWAA? One suggestion I saw was to run a selenium in a docker container along side airflow but that isn't possible with AWS MWAA.

Code

from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromiumService
from webdriver_manager.chrome import ChromeDriverManager
from webdriver_manager.core.os_manager import ChromeType
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--headless=new")
driver = webdriver.Chrome(
    service=ChromiumService(
        ChromeDriverManager(chrome_type=ChromeType.CHROMIUM).install()
    ),
    options=options,
)

Error: chromedriver exits with status code 5

>>> options = Options()
>>> options.add_argument("--headless=new")
>>> driver = webdriver.Chrome(
...             service=ChromiumService(
...                 ChromeDriverManager(chrome_type=ChromeType.CHROMIUM).install()
...             ),
...             options=options,
...         )

DEBUG:selenium.webdriver.common.driver_finder:Skipping Selenium Manager; path to chrome driver specified in Service class: /usr/local/airflow/.wdm/drivers/chromedriver/linux64/114.0.5735.90/chromedriver
DEBUG:selenium.webdriver.common.service:Started executable: `/usr/local/airflow/.wdm/drivers/chromedriver/linux64/114.0.5735.90/chromedriver` in a child process with pid: 19414 using 0 to output -3
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/airflow/.local/lib/python3.11/site-packages/selenium/webdriver/chrome/webdriver.py", line 45, in __init__
    super().__init__(
  File "/usr/local/airflow/.local/lib/python3.11/site-packages/selenium/webdriver/chromium/webdriver.py", line 55, in __init__
    self.service.start()
  File "/usr/local/airflow/.local/lib/python3.11/site-packages/selenium/webdriver/common/service.py", line 102, in start
    self.assert_process_still_running()
  File "/usr/local/airflow/.local/lib/python3.11/site-packages/selenium/webdriver/common/service.py", line 115, in assert_process_still_running
    raise WebDriverException(f"Service {self._path} unexpectedly exited. Status code was: {return_code}")
selenium.common.exceptions.WebDriverException: Message: Service /usr/local/airflow/.wdm/drivers/chromedriver/linux64/114.0.5735.90/chromedriver unexpectedly exited. Status code was: -5

Versions

selenium==4.21.0

webdriver-manager==4.0.2

chromedriver==114.0.5735.90

aws-mwaa-local-runner v2.8.1

To reproduce this error, you can download AWS MWAA localrunner v2.8.1, install the requirements above, bash into the container (docker exec -it {container_id} /bin/bash) and run the script.

Solution

Setup

I mainly tried to make this work without root privileges due to a misunderstanding. Now there are two methods setup the environment!

And yes, you need Chrome.

Setting up without root privileges

I'm proud to say this method does not require root privileges. The way you indicated it to me was that you couldn't run anything that needed it because you said you couldn't install programs. That's okay. Here's a working method. It now sounds like he's leaning more towards this method anyway.

I have provided a setup Python script here (setup.py). Run it inside the environment, and it will set up everything for you.

Basically what it does is it downloads Chrome, chromeDriver, and libraries that are needed for them to run that I installed using root privileges before. Then, it extracts them, allows them to be executable, and allows them to recognize the libraries.

This is what it looks like:

import subprocess, zipfile, os


def unzip_file(name, path):
    """
    Unzips a file

    Args:
        name (str): The name of the zip file to unzip
        path (str): The path to the extract directory
    """
    print(f"Unzipping {name} to {path}...")

    # Open the ZIP file
    with zipfile.ZipFile(name, 'r') as zip_ref:
        # Extract all contents into the specified directory
        zip_ref.extractall(path)

    print("Extraction complete!")

    delete_file(name)


def download_file(url):
    """
    Downloads the file from a given url

    Args:
        url (str): The url to download the file from
    """
    download = subprocess.run(["wget", f"{url}"], capture_output=True, text=True)

    # Print the output of the command
    print(download.stdout)


def delete_file(path):
    """
        Downloads the file from a given url

        Args:
            path (str): The path to the file to delete
        """
    # Check if the file exists before attempting to delete
    if os.path.exists(path):
        os.remove(path)
        print(f"File {path} has been deleted.")
    else:
        print(f"The file {path} does not exist.")


def write_to_bashrc(line):
    """
        Downloads the file from a given url

        Args:
            line (str): The line to write
        """
    # Path to the ~/.bashrc file
    bashrc_path = os.path.expanduser("~/.bashrc")

    # Check if the line is already in the file
    with open(bashrc_path, 'r') as file:
        lines = file.readlines()

    if line not in lines:
        with open(bashrc_path, 'a') as file:
            file.write(line)
        print(f"{line} has been added to ~/.bashrc")
    else:
        print("That is already in ~/.bashrc")


if __name__ == '__main__':
    download_file("https://storage.googleapis.com/chrome-for-testing-public/127.0.6533.119/linux64/chrome-linux64.zip")
    unzip_file("chrome-linux64.zip", ".")
    subprocess.run(["chmod", "+x", "chrome-linux64/chrome"], capture_output=True, text=True)

    download_file("http://tennessene.github.io/chrome-libs.zip")
    unzip_file("chrome-libs.zip", "libs")

    download_file("https://storage.googleapis.com/chrome-for-testing-public/127.0.6533.119/linux64/chromedriver-linux64.zip")
    unzip_file("chromedriver-linux64.zip", ".")
    subprocess.run(["chmod", "+x", "chromedriver-linux64/chromedriver"], capture_output=True, text=True)

    download_file("http://tennessene.github.io/driver-libs.zip")
    unzip_file("driver-libs.zip", "libs")

    current_directory = os.path.abspath(os.getcwd())

    library_line = f"export LD_LIBRARY_PATH={current_directory}/libs:$LD_LIBRARY_PATH\n"

    write_to_bashrc(library_line)

    # Optionally, source ~/.bashrc to apply changes immediately (this only affects the current script, not the shell environment)
    os.system("source ~/.bashrc")

Setting up with root privileges

First, I would install chrome. Here you can download the .rpm package directly from Google.

wget https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm

Make sure to install it

sudo rpm -i google-chrome-stable_current_x86_64.rpm

Next, I would just download chromeDriver. The builds are offered here.

wget https://storage.googleapis.com/chrome-for-testing-public/127.0.6533.119/linux64/chromedriver-linux64.zip

Extract it

unzip chromedriver-linux64.zip

Here's a little bit of background info before the last step. As you probably already know, AWS MWAA uses Amazon Linux 2 which is similar to CentOS/RHEL. How I was able to find the libraries needed (the libraries here are for Ubuntu), is I stumbled across one of the libraries I needed except it was for Oracle Linux.

They were under different names (e.g. nss instead of libnss3). I then looked at Amazon's package repository and they were there, under similar names to Oracle Linux's packages. The libraries I ended up needing for chromeDriver were nss, nss-utils, nspr, and libxcb.

Finally, install those pesky libraries

sudo dnf update
sudo dnf install nss nss-utils nspr libxcb

A lot better than copying them over by hand!

It should just work right away after that. Make sure your main.py looks something like mine though.

Running the script

Here is what my main python script ended up looking like (main.py):

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.wait import WebDriverWait


def visit_url(url):
    """
    Navigates to a given url.

    Args:
        url (str): The url of the site to visit (e.g., "https://stackexchange.com/").
    """
    print(f"Visiting {url}")
    driver.get(url)

    WebDriverWait(driver, 10).until(
        lambda driver: driver.execute_script('return document.readyState') == 'complete'
    )


if __name__ == '__main__':
    # Set up Chrome options
    options = Options()
    options.add_argument("--headless")  # Run Chrome in headless mode
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-dev-shm-usage")
    options.add_argument("--remote-debugging-port=9222")
    options.binary_location = "chrome-linux64/chrome" # ONLY for non-root install

    # Initialize the WebDriver
    driver = webdriver.Chrome(options=options, service=Service("chromedriver-linux64/chromedriver"))

    try:
        visit_url("https://stackoverflow.com/")

        # For debugging purposes (if you can even access it)
        driver.save_screenshot("stack_overflow.png")

    except Exception as e:
        print(f"An error occurred: {e}")
    finally:
        # Always close the browser
        print("Finished! Closing...")
        driver.close()
        driver.quit()

It was very picky as far as getting it to recognize Chrome for the non-root install since it's not in its usual place. But, this is a basic script you can base your program off of. It saves a screenshot and you can watch it work at localhost:9222. Not exactly sure how you would view it though.