Search code examples
pythonseleniumwebdriverweb-crawler

Only update webdriver for one time and use it for various functions and loops


I have a simple web crawler, and I would use it in a loop to crawl information from youtube videos, as shown down below

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time

def Scrap(url):
    options = webdriver.ChromeOptions()
    driver = webdriver.Chrome(executable_path=ChromeDriverManager().install())
    driver.get(url)
    time.sleep(6)

    #I will do some operations with the page source here

    driver.close()

urls = ["https://www.youtube.com/watch?v=FWMIPukvdsQ", "https://www.youtube.com/watch?v=Ot4qdCs54ZE"]

for url in urls :
    Scrap(url)

Everything works fine, but it is annoying that I have to installed the driver twice. I think it has significantly slowed down the program when I crawl data from hundreds of websites. And it feels bad. I have tried two methods to only install the driver once and use it in various functions and loops.

Method 1: Manually assigned the path:

def update_driver():
    driver = webdriver.Chrome(executable_path=ChromeDriverManager().install())

Then, the output will include the path of the installed driver, and I will manually copy and assign it to a variable so that other crawler functions can use.

Problem with Method 1: I have to copy and paste it. Is there any way to automate it? Maybe I can gett the output of the installation and filter it?

Method 2: Make driver a global variable. Problem with Method 2: It would report errors when the driver is used for more than one url.


Solution

  • The Problem

    • You are using the driver inside the scrap method which is making it to launch the browser as many times as your len(URLs)

    • You closing the browser driver.close inside the scrap method which should be done after the loop.

    The Solution

    driver = webdriver.Chrome(executable_path=ChromeDriverManager().install())
    
    
    def Scrap(url):
        driver.get(url)
        time.sleep(1)
    
        # I will do some operations with the page source here
    
    
    urls = ["https://www.youtube.com/watch?v=FWMIPukvdsQ", "https://www.youtube.com/watch?v=Ot4qdCs54ZE"]
    
    for url in urls:
        Scrap(url)
    
    driver.close()