Search code examples
python-3.xamazon-web-servicesdockerkubernetesselenium-chromedriver

Running Selenium on a server such as AWS Lightsail for ETL pipeline (MageAI)


Quick introduction: MageAI is an ETL pipeline tool that is similar to Airflow. I'm using MageAI to run cronjobs daily to crawl websites using Selenium. As of now, MageAI offers installation via docker-compose, kubernetes, and via pip install. When I install it via docker-compose or kubernetes, Selenium driver will not start. But when I use pip install, selenium works.

Scenario of how it is working via pip install: I'm using AWS Lightsail to run my ubuntu instance. Within this server, I did pip3 install mage-ai selenium. This works and the selenium driver was able to start. The problem of this approach is that, since it is running on my server as a python module, it can be unstable as the server may go down.

Ideal scenario: If I were to implement this via docker-compose or even kubernetes, it will become even more stable and scalable. But all the approaches I've done so far will result in the same error which is selenium chrome driver failed to start.

What I've done so far:

  1. I tried extending from the MageAI docker image and installed all the necessary dependencies to run Selenium and Mage, it gave me the same error.
  2. I tried extending from a ubuntu docker image, installed MageAI and Selenium via pip3 install, and it still didn't work.

Anyone would know how to successfully run selenium via docker-compose or even kubernetes?


Solution

  • version: '3'
    services:
      magic:
        image: mageai/mageai:latest
        <mage-ai stuff here>
    
    
      selenium:
        image: selenium/standalone-chrome:latest
        environment:
          - SE_NODE_OVERRIDE_MAX_SESSIONS=true
          - SE_NODE_MAX_SESSIONS=10
          - SE_NODE_GRID_URL=https://0.0.0.0:4444
        ports:
          - 4444:4444
    

    Use the selenium docker image and have it as a service.

    Call within Python like so:

    selenium_server_url = "http://<ip>:4444/wd/hub"
    
    driver = webdriver.Remote(command_executor=selenium_server_url, options=chrome_options)
    

    Make sure to add this after crawling to free up instances: driver.quit()