python selenium pyspark databricks azure-databricks

How to use Selenium in Databricks and accessing and moving downloaded files to mounted storage and keep Chrome and ChromeDriver versions in sync?

I've seen a couple of posts on using Selenium in Databricks using %shto install Chrome Drivers and Chrome. This works fine for me, but I had a lot of trouble when I needed to download a file. The file would download, but I could not find it in the filesystem in databricks. Even if I changed the download path when instatiating Chrome to a mounted folder on Azure Blob Storage, the file would not be placed there after downloading. There is also a problem of keeping the Chrome browser and ChromeDriver version in sync automatically without manually changing the version numbers.

Following links show people with the same problem but no clear answer:

https://forums.databricks.com/questions/19376/if-my-notebook-downloads-a-file-from-a-website-by.html

https://forums.databricks.com/questions/45388/selenium-in-databricks-with-add-experimental-optio.html

Is there a way to identify where the file gets downloaded in Azure Databricks when I do web automation using Selenium Python?

And some struggling with getting Selenium to run properly at all: https://forums.databricks.com/questions/14814/selenium-in-databricks.html

not in path error: https://webcache.googleusercontent.com/search?q=cache:NrvVKo4LLdIJ:https://stackoverflow.com/questions/57904372/cannot-get-selenium-webdriver-to-work-in-azure-databricks+&cd=5&hl=en&ct=clnk&gl=us

Is there a clear guide to use Selenium on Databricks and manage downloaded files? And how can I keep the Chrome browser and ChromeDriver versions in sync automatically?

Solution

Here is the guide to installing Selenium, Chrome, and ChromeDriver. This will also move a file after downloading via Selenium to your mounted storage. Each number should be in its own cell.

Install Selenium

%pip install selenium

Do your imports

import pickle as pkl
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

Download the latest ChromeDriver to the DBFS root storage /tmp/. The curl command will get the latest Chrome version and store in the version variable. Note the escape \ before the $.

%sh
version=`curl -sS https://chromedriver.storage.googleapis.com/LATEST_RELEASE`
wget -N https://chromedriver.storage.googleapis.com/\${version}/chromedriver_linux64.zip -O /tmp/chromedriver_linux64.zip

Unzip the file to a new folder in the DBFS root /tmp/. I tried to use non-root path and it does not work.

%sh
unzip /tmp/chromedriver_linux64.zip -d /tmp/chromedriver/

Get the latest Chrome download and install it.

%sh
sudo curl -sS -o - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add
sudo echo "deb https://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list
sudo apt-get -y update
sudo apt-get -y install google-chrome-stable

** Steps 3 - 5 can be combined into one command. You can also use the following to create a shell script and use it as an init file to configure for your clusters and is especially useful when using job clusters which use transient clusters because init scripts apply to all worker nodes rather than just the driver node. This also installs Selenium, allowing you to skip step 1. Just paste in one cell in a new notebook, run, then point your init script to dbfs:/init/init_selenium.sh. Now every time the cluster or transient cluster spins up, this will install Chrome, ChromeDriver, and Selenium on all worker nodes before your job begins to run.

%sh
# dbfs:/init/init_selenium.sh
cat > /dbfs/init/init_selenium.sh <<EOF
#!/bin/sh
echo Install Chrome and Chrome driver
version=`curl -sS https://chromedriver.storage.googleapis.com/LATEST_RELEASE`
wget -N https://chromedriver.storage.googleapis.com/\${version}/chromedriver_linux64.zip -O /tmp/chromedriver_linux64.zip
unzip /tmp/chromedriver_linux64.zip -d /tmp/chromedriver/
sudo curl -sS -o - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add
sudo echo "deb https://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list
sudo apt-get -y update
sudo apt-get -y install google-chrome-stable
pip install selenium
EOF
cat /dbfs/init/init_selenium.sh

Configure your storage account. Example is for Azure Blob Storage using ADLSGen2.

service_principal_id = "YOUR_SP_ID"
service_principle_key = "YOUR_SP_KEY"
tenant_id = "YOUR_TENANT_ID"
directory = "https://login.microsoftonline.com/" + tenant_id + "/oauth2/token"
configs = {"fs.azure.account.auth.type": "OAuth",
       "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
       "fs.azure.account.oauth2.client.id":  service_principal_id,
       "fs.azure.account.oauth2.client.secret": service_principle_key,
       "fs.azure.account.oauth2.client.endpoint": directory,
       "fs.azure.createRemoteFileSystemDuringInitialization": "true"}

Configure your mounting location and mount.

mount_point = "/mnt/container-data/"
mount_point_main = "/dbfs/mnt/container-data/"
container = "container-data"
storage_account = "adlsgen2"
storage = "abfss://"+ container +"@"+ storage_account + ".dfs.core.windows.net"
utils_folder = mount_point + "utils/selenium/"
raw_folder = mount_point + "raw/"

if not any(mount_point in mount_info for mount_info in dbutils.fs.mounts()):
  dbutils.fs.mount(
    source = storage,
    mount_point = mount_point,
    extra_configs = configs)
  print(mount_point + " has been mounted.")
else:
  print(mount_point + " was already mounted.")
print(f"Utils folder: {utils_folder}")
print(f"Raw folder: {raw_folder}")

Create method for instantiating Chrome browser. I need to load in a cookies file that I have placed in my utils folder which points to mnt/container-data/utils/selenium. Make sure the arguments are the same (no sandbox, headless, disable-dev-shm-usage)

def init_chrome_browser(download_path, chrome_driver_path, cookies_path, url):
    """
    Instatiates a Chrome browser.

    Parameters
    ----------
    download_path : str
        The download path to place files downloaded from this browser session.
    chrome_driver_path : str
        The path of the chrome driver executable binary (.exe file).
    cookies_path : str
        The path of the cookie file to load in (.pkl file).
    url : str
        The URL address of the page to initially load.

    Returns
    -------
    Browser
        Returns the instantiated browser object.
    """
    
    options = Options()
    prefs = {'download.default_directory' : download_path}
    options.add_experimental_option('prefs', prefs)
    options.add_argument('--no-sandbox')
    options.add_argument('--headless')
    options.add_argument('--disable-dev-shm-usage')
    options.add_argument('--start-maximized')
    options.add_argument('window-size=2560,1440')
    print(f"{datetime.now()}    Launching Chrome...")
    browser = webdriver.Chrome(service=Service(chrome_driver_path), options=options)
    print(f"{datetime.now()}    Chrome launched.")
    browser.get(url)
    print(f"{datetime.now()}    Loading cookies...")
    cookies = pkl.load(open(cookies_path, "rb"))
    for cookie in cookies:
        browser.add_cookie(cookie)
    browser.get(url)
    print(f"{datetime.now()}    Cookies loaded.")
    print(f"{datetime.now()}    Browser ready to use.")
    return browser

Instatiate browser. Set the downloads location to the DBFS root file system /tmp/downloads. Make sure the cookies path has /dbfs in front so the full cookies path is like /dbfs/mnt/...

browser = init_chrome_browser(
    download_path="/tmp/downloads",
    chrome_driver_path="/tmp/chromedriver/chromedriver",
    cookies_path="/dbfs"+ utils_folder + "cookies.pkl",
    url="YOUR_URL"
)

Do your navigating and any downloads you need.
OPTIONAL: Examine your download location. In this example, I downloaded a CSV file and will search through the downloaded folder until I find that file format.

import os
import os.path
for root, directories, filenames in os.walk('/tmp'):
    print(root)
    if any(".csv" in s for s in filenames):
        print(filenames)
        break

Copy the file from DBFS root tmp to your mounted storage (/mnt/container-data/raw/). You can rename during this operation as well. You can only access root file system using file: prefix when using dbutils.

dbutils.fs.cp("file:/tmp/downloads/file1.csv", f"{raw_folder}file2.csv')