python-3.x selenium-webdriver web-scraping headless-browser

How to save time in scraping datas with headless mode and selenium webdriver in Python

Hello I have a simple python script which opens and extracts automatically datas from a webpage. It takes 5 seconds to do it. In my case I would like a faster script which runs instantaneously or 2 seconds max.

Here is the script :

#!/usr/bin/python3
# -*- coding: utf-8 -*-

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import numpy as np

options = Options()
options.headless = True
options.add_argument("window-size=1400,800")
options.add_argument("--no-sandbox")
options.add_argument("--disable-gpu")
options.add_argument("start-maximized")
options.add_argument("enable-automation")
options.add_argument("--disable-infobars")
options.add_argument("--disable-dev-shm-usage")

url = 'https://www.coteur.com/match/cotes-barcelone-huesca-rid1163090.html'
driver = webdriver.Chrome(options=options)
driver.get(url)

odds = [my_elem.text for my_elem in WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, '//button[contains(@class, "btn btn-default btn-xs btncote")]')))]

columns = 3
rows = int(len(odds)/columns)
odds = [float(i) for i in odds]
odds = np.array(odds)
odds = odds.reshape(rows, columns)

print(odds, '\n')
                
driver.close()
driver.quit()

Maybe you can help to improve this little script to save some precious seconds. Thanks

Here is the output of the execution :

[[ 1.18  8.25 17.  ]
 [ 1.18  8.25 17.  ]
 [ 1.18  8.1  17.  ]
 [ 1.14  8.   17.  ]
 [ 1.16  8.75 18.  ]
 [ 1.2   7.25 10.  ]
 [ 1.14  7.75 16.  ]
 [ 1.17  8.   16.  ]
 [ 1.16  8.8  19.  ]
 [ 1.16  7.   12.  ]
 [ 1.13  8.5  18.5 ]] 


real    0m4,978s
user    0m1,342s
sys 0m0,573s

It takes 5 seconds to run it. My goal is to diminish the execution time

Solution

Your execution time might depend on several factors:

the machine you're running the code on
the bandwidth of your connection
how much data you're requesting

Having said that, I've used your code and got an execution time of 2.31 seconds.

#!/usr/bin/python3
# -*- coding: utf-8 -*-
import time

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import numpy as np

options = Options()
options.headless = True
options.add_argument("window-size=1400,800")
options.add_argument("--no-sandbox")
options.add_argument("--disable-gpu")
options.add_argument("start-maximized")
options.add_argument("enable-automation")
options.add_argument("--disable-infobars")
options.add_argument("--disable-dev-shm-usage")

t0 = time.monotonic()
driver = webdriver.Chrome(options=options)
driver.get('https://www.coteur.com/match/cotes-barcelone-huesca-rid1163090.html')
elements = WebDriverWait(
    driver,
    2,
).until(
    EC.visibility_of_all_elements_located(
        (By.XPATH, '//button[contains(@class, "btn btn-default btn-xs btncote")]')
    )
)

odds = np.array([float(my_elem.text) for my_elem in elements])
odds = odds.reshape(int(len(odds) / 3), 3)
print(odds)
t1 = time.monotonic()
print(f"{t1-t0:.2f}")