Search code examples
pythondockergoogle-chromeselenium-webdriverweb-scraping

I can't get Selenium Chrome to work in Docker with Python


I have a classic "it works on my machine" problem, a web scraper I ran successfully on my laptop, but with a persistent error whenever I tried and run it in a container.

My minimal reproducible dockerized example consists of the following files:

requirements.txt:

selenium==4.23.1  # 4.23.1
pandas==2.2.2
pandas-gbq==0.22.0
tqdm==4.66.2

Dockerfile:

FROM selenium/standalone-chrome:latest

# Set the working directory in the container
WORKDIR /usr/src/app

# Copy your application files
COPY . .

# Install Python and pip
USER root
RUN apt-get update && apt-get install -y python3 python3-pip python3-venv

# Create a virtual environment
RUN python3 -m venv /usr/src/app/venv

# Activate the virtual environment and install dependencies
RUN . /usr/src/app/venv/bin/activate && \
    pip install --no-cache-dir -r requirements.txt

# Switch back to the selenium user
USER seluser

# Set the entrypoint to activate the venv and run your script
CMD ["/bin/bash", "-c", "source /usr/src/app/venv/bin/activate && python -m scrape_ev_files"]

scrape_ev_files.py (slimmed down to just what's needed to repro error):

import os
from selenium import webdriver
from selenium.webdriver.chrome.service import Service


def init_driver(local_download_path):
    os.makedirs(local_download_path, exist_ok=True)

    # Set Chrome Options    
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    chrome_options.add_argument("--remote-debugging-port=9222")

    prefs = {
        "download.default_directory": local_download_path,
        "download.prompt_for_download": False,
        "download.directory_upgrade": True,
        "safebrowsing.enabled": True
    }
    chrome_options.add_experimental_option("prefs", prefs)

    # Set up the driver
    service = Service()

    chrome_options = Options()
    driver = webdriver.Chrome(service=service, options=chrome_options)

    # Set download behavior
    driver.execute_cdp_cmd("Page.setDownloadBehavior", {
        "behavior": "allow",
        "downloadPath": local_download_path
    })

    return driver

if __name__ == "__main__":
    # PARAMS
    ELECTION = '2024 MARCH 5TH DEMOCRATIC PRIMARY'
    ORIGIN_URL = "https://earlyvoting.texas-election.com/Elections/getElectionDetails.do"
    CSV_DL_DIR = "downloaded_files"

    # initialize the driver
    driver = init_driver(local_download_path=CSV_DL_DIR)

shell command to reproduce the error:

docker build -t my_scraper .  # (no error)
docker run --rm -t my_scraper # (error)

stacktrace from error is below. Any help would be much appreciated! I've tried many iterations of my requirements.txt and Dockerfile attempting to fix this, but this error at this spot has been frustratingly persistent:

  File "/workspace/scrape_ev_files.py", line 110, in <module>
    driver = init_driver(local_download_path=CSV_DL_DIR)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/scrape_ev_files.py", line 47, in init_driver
    driver = webdriver.Chrome(service=service, options=chrome_options)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/.venv/lib/python3.12/site-packages/selenium/webdriver/chrome/webdriver.py", line 45, in __init__
    super().__init__(
  File "/workspace/.venv/lib/python3.12/site-packages/selenium/webdriver/chromium/webdriver.py", line 66, in __init__
    super().__init__(command_executor=executor, options=options)
  File "/workspace/.venv/lib/python3.12/site-packages/selenium/webdriver/remote/webdriver.py", line 212, in __init__
    self.start_session(capabilities)
  File "/workspace/.venv/lib/python3.12/site-packages/selenium/webdriver/remote/webdriver.py", line 299, in start_session
    response = self.execute(Command.NEW_SESSION, caps)["value"]
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/.venv/lib/python3.12/site-packages/selenium/webdriver/remote/webdriver.py", line 354, in execute
    self.error_handler.check_response(response)
  File "/workspace/.venv/lib/python3.12/site-packages/selenium/webdriver/remote/errorhandler.py", line 229, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.SessionNotCreatedException: Message: session not created: Chrome failed to start: exited normally.
  (session not created: DevToolsActivePort file doesn't exist)
  (The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)

Solution

  • You override the chrome_options variable just before sending it to webdriver.Chrome() so there are no options defined, --disable-dev-shm-usage (this option solves that issue) in particular.

    Just remove chrome_options = Options() just before the driver initialization.

    As a side note, consider using --headless=new instead of --headless, it gives functionality closer to regular chrome and --headless will be deprecated in future versions.

    Edit

    The image you are using is turning off the Selenium manager, so you get this warning. You can turn it back on by adding ENV SE_OFFLINE=false to the dockerfile.

    The driver initialization sometimes hangs and raise TimeoutException: Message: timeout: Timed out receiving message from renderer: 600.000. This is probably due to too many JS commands. Add those options

    chrome_options.add_argument('--dns-prefetch-disable')
    chrome_options.add_argument('--disable-gpu')
    chrome_options.add_argument('--enable-cdp-events')