Search code examples
python-3.xselenium-webdriverweb-scrapingheadless-browser

How to save time in scraping datas with headless mode and selenium webdriver in Python


Hello I have a simple python script which opens and extracts automatically datas from a webpage. It takes 5 seconds to do it. In my case I would like a faster script which runs instantaneously or 2 seconds max.

Here is the script :

#!/usr/bin/python3
# -*- coding: utf­-8 ­-*-

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import numpy as np

options = Options()
options.headless = True
options.add_argument("window-size=1400,800")
options.add_argument("--no-sandbox")
options.add_argument("--disable-gpu")
options.add_argument("start-maximized")
options.add_argument("enable-automation")
options.add_argument("--disable-infobars")
options.add_argument("--disable-dev-shm-usage")

url = 'https://www.coteur.com/match/cotes-barcelone-huesca-rid1163090.html'
driver = webdriver.Chrome(options=options)
driver.get(url)

odds = [my_elem.text for my_elem in WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, '//button[contains(@class, "btn btn-default btn-xs btncote")]')))]

columns = 3
rows = int(len(odds)/columns)
odds = [float(i) for i in odds]
odds = np.array(odds)
odds = odds.reshape(rows, columns)

print(odds, '\n')
                
driver.close()
driver.quit()

Maybe you can help to improve this little script to save some precious seconds. Thanks


Here is the output of the execution :

[[ 1.18  8.25 17.  ]
 [ 1.18  8.25 17.  ]
 [ 1.18  8.1  17.  ]
 [ 1.14  8.   17.  ]
 [ 1.16  8.75 18.  ]
 [ 1.2   7.25 10.  ]
 [ 1.14  7.75 16.  ]
 [ 1.17  8.   16.  ]
 [ 1.16  8.8  19.  ]
 [ 1.16  7.   12.  ]
 [ 1.13  8.5  18.5 ]] 


real    0m4,978s
user    0m1,342s
sys 0m0,573s

It takes 5 seconds to run it. My goal is to diminish the execution time


Solution

  • Your execution time might depend on several factors:

    • the machine you're running the code on
    • the bandwidth of your connection
    • how much data you're requesting

    Having said that, I've used your code and got an execution time of 2.31 seconds.

    #!/usr/bin/python3
    # -*- coding: utf­-8 ­-*-
    import time
    
    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    import numpy as np
    
    options = Options()
    options.headless = True
    options.add_argument("window-size=1400,800")
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-gpu")
    options.add_argument("start-maximized")
    options.add_argument("enable-automation")
    options.add_argument("--disable-infobars")
    options.add_argument("--disable-dev-shm-usage")
    
    t0 = time.monotonic()
    driver = webdriver.Chrome(options=options)
    driver.get('https://www.coteur.com/match/cotes-barcelone-huesca-rid1163090.html')
    elements = WebDriverWait(
        driver,
        2,
    ).until(
        EC.visibility_of_all_elements_located(
            (By.XPATH, '//button[contains(@class, "btn btn-default btn-xs btncote")]')
        )
    )
    
    odds = np.array([float(my_elem.text) for my_elem in elements])
    odds = odds.reshape(int(len(odds) / 3), 3)
    print(odds)
    t1 = time.monotonic()
    print(f"{t1-t0:.2f}")