I've seen a couple of posts on using Selenium in Databricks using %sh
to install Chrome Drivers and Chrome. This works fine for me, but I had a lot of trouble when I needed to download a file. The file would download, but I could not find it in the filesystem in databricks. Even if I changed the download path when instatiating Chrome to a mounted folder on Azure Blob Storage, the file would not be placed there after downloading. There is also a problem of keeping the Chrome browser and ChromeDriver version in sync automatically without manually changing the version numbers.
Following links show people with the same problem but no clear answer:
https://forums.databricks.com/questions/19376/if-my-notebook-downloads-a-file-from-a-website-by.html
And some struggling with getting Selenium to run properly at all: https://forums.databricks.com/questions/14814/selenium-in-databricks.html
Is there a clear guide to use Selenium on Databricks and manage downloaded files? And how can I keep the Chrome browser and ChromeDriver versions in sync automatically?
Here is the guide to installing Selenium, Chrome, and ChromeDriver. This will also move a file after downloading via Selenium to your mounted storage. Each number should be in its own cell.
%pip install selenium
import pickle as pkl
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
/tmp/
. The curl command will get the latest Chrome version and store in the version
variable. Note the escape \
before the $
.%sh
version=`curl -sS https://chromedriver.storage.googleapis.com/LATEST_RELEASE`
wget -N https://chromedriver.storage.googleapis.com/\${version}/chromedriver_linux64.zip -O /tmp/chromedriver_linux64.zip
/tmp/
. I tried to use non-root path and it does not work.%sh
unzip /tmp/chromedriver_linux64.zip -d /tmp/chromedriver/
%sh
sudo curl -sS -o - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add
sudo echo "deb https://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list
sudo apt-get -y update
sudo apt-get -y install google-chrome-stable
** Steps 3 - 5 can be combined into one command. You can also use the following to create a shell script and use it as an init file to configure for your clusters and is especially useful when using job clusters which use transient clusters because init scripts apply to all worker nodes rather than just the driver node. This also installs Selenium, allowing you to skip step 1. Just paste in one cell in a new notebook, run, then point your init script to dbfs:/init/init_selenium.sh
. Now every time the cluster or transient cluster spins up, this will install Chrome, ChromeDriver, and Selenium on all worker nodes before your job begins to run.
%sh
# dbfs:/init/init_selenium.sh
cat > /dbfs/init/init_selenium.sh <<EOF
#!/bin/sh
echo Install Chrome and Chrome driver
version=`curl -sS https://chromedriver.storage.googleapis.com/LATEST_RELEASE`
wget -N https://chromedriver.storage.googleapis.com/\${version}/chromedriver_linux64.zip -O /tmp/chromedriver_linux64.zip
unzip /tmp/chromedriver_linux64.zip -d /tmp/chromedriver/
sudo curl -sS -o - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add
sudo echo "deb https://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list
sudo apt-get -y update
sudo apt-get -y install google-chrome-stable
pip install selenium
EOF
cat /dbfs/init/init_selenium.sh
service_principal_id = "YOUR_SP_ID"
service_principle_key = "YOUR_SP_KEY"
tenant_id = "YOUR_TENANT_ID"
directory = "https://login.microsoftonline.com/" + tenant_id + "/oauth2/token"
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": service_principal_id,
"fs.azure.account.oauth2.client.secret": service_principle_key,
"fs.azure.account.oauth2.client.endpoint": directory,
"fs.azure.createRemoteFileSystemDuringInitialization": "true"}
mount_point = "/mnt/container-data/"
mount_point_main = "/dbfs/mnt/container-data/"
container = "container-data"
storage_account = "adlsgen2"
storage = "abfss://"+ container +"@"+ storage_account + ".dfs.core.windows.net"
utils_folder = mount_point + "utils/selenium/"
raw_folder = mount_point + "raw/"
if not any(mount_point in mount_info for mount_info in dbutils.fs.mounts()):
dbutils.fs.mount(
source = storage,
mount_point = mount_point,
extra_configs = configs)
print(mount_point + " has been mounted.")
else:
print(mount_point + " was already mounted.")
print(f"Utils folder: {utils_folder}")
print(f"Raw folder: {raw_folder}")
utils
folder which points to mnt/container-data/utils/selenium
. Make sure the arguments are the same (no sandbox, headless, disable-dev-shm-usage)def init_chrome_browser(download_path, chrome_driver_path, cookies_path, url):
"""
Instatiates a Chrome browser.
Parameters
----------
download_path : str
The download path to place files downloaded from this browser session.
chrome_driver_path : str
The path of the chrome driver executable binary (.exe file).
cookies_path : str
The path of the cookie file to load in (.pkl file).
url : str
The URL address of the page to initially load.
Returns
-------
Browser
Returns the instantiated browser object.
"""
options = Options()
prefs = {'download.default_directory' : download_path}
options.add_experimental_option('prefs', prefs)
options.add_argument('--no-sandbox')
options.add_argument('--headless')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--start-maximized')
options.add_argument('window-size=2560,1440')
print(f"{datetime.now()} Launching Chrome...")
browser = webdriver.Chrome(service=Service(chrome_driver_path), options=options)
print(f"{datetime.now()} Chrome launched.")
browser.get(url)
print(f"{datetime.now()} Loading cookies...")
cookies = pkl.load(open(cookies_path, "rb"))
for cookie in cookies:
browser.add_cookie(cookie)
browser.get(url)
print(f"{datetime.now()} Cookies loaded.")
print(f"{datetime.now()} Browser ready to use.")
return browser
/tmp/downloads
. Make sure the cookies path has /dbfs
in front so the full cookies path is like /dbfs/mnt/...
browser = init_chrome_browser(
download_path="/tmp/downloads",
chrome_driver_path="/tmp/chromedriver/chromedriver",
cookies_path="/dbfs"+ utils_folder + "cookies.pkl",
url="YOUR_URL"
)
Do your navigating and any downloads you need.
OPTIONAL: Examine your download location. In this example, I downloaded a CSV file and will search through the downloaded folder until I find that file format.
import os
import os.path
for root, directories, filenames in os.walk('/tmp'):
print(root)
if any(".csv" in s for s in filenames):
print(filenames)
break
/mnt/container-data/raw/
). You can rename during this operation as well. You can only access root file system using file:
prefix when using dbutils.dbutils.fs.cp("file:/tmp/downloads/file1.csv", f"{raw_folder}file2.csv')