Search code examples
pythonseleniumscreen-scrapingjoblib

How to reuse a selenium driver instance during parallel processing?


To scrape a pool of URLs, I am paralell processing selenium with joblib. In this context, I am facing two challenges:

  • Challenge 1 is to speed up this process. In the moment, my code opens and closes a driver instance for every URL (ideally would be one for every process)
  • Challenge 2 is to get rid of the CPU-intensive while loop that I think I need to continue on empty results (I know that this is most likely wrong)

Pseudocode:

URL_list = [URL1, URL2, URL3, ..., URL100000]                 # List of URLs to be scraped

def scrape(URL):
  while True:                                                 # Loop needed to use continue
          try:                                                # Try scraping
             driver = webdriver.Firefox(executable_path=path) # Set up driver
             website = driver.get(URL)                        # Get URL
             results = do_something(website)                  # Get results from URL content                                                
             driver.close()                                   # Close worker
             if len(results) == 0:                            # If do_something() failed:                                                                  
                continue                                      # THEN Worker to skip URL                          
             else:                                            # If do_something() worked:
                safe_results("results.csv")                   # THEN Save results               
                break                                         # Go to next worker/URL
          except Exception as e:                              # If something weird happens:  
                save_exception(URL, e)                        # THEN Save error message
                break                                         # Go to next worker/URL

Parallel(n_jobs = 40)(delayed(scrape)(URL) for URL in URL_list))) # Run in 40 processes

My understanding is that in order to re-use a driver instance across iterations, the # Set up driver-line needs to be placed outside scrape(URL). However, everything outside scrape(URL) will not find its way to joblib's Parallel(n_jobs = 40). This would imply that you can't reuse driver instances while scraping with joblib which can't be true.

Q1: How to reuse driver instances during parallel processing in the above example?

Q2: How to get rid of the while-loop while maintaining functionality in the above-mentioned example?

Note: Flash and image loading is disabled in firefox_profile (code not shown)


Solution

  • 1) You should first create a bunch of drivers: one for each process. And pass an instance to the worker. I don't know how to pass drivers to an Prallel object, but you could use threading.current_thread().name key to identify drivers. To do that, use backend="threading". So now each thread will has its own driver.

    2) You don't need a loop at all. Parallel object itself iter all your urls (I hope I realy understend your intentions to use a loop)

    import threading
    from joblib import Parallel, delayed
    from selenium import webdriver
    
    def scrape(URL):
        try:
            driver = drivers[threading.current_thread().name]
        except KeyError:
            drivers[threading.current_thread().name] = webdriver.Firefox()
            driver = drivers[threading.current_thread().name]
        driver.get(URL)
        results = do_something(driver)
        if results:
            safe_results("results.csv")
    
    drivers = {}
    Parallel(n_jobs=-1, backend="threading")(delayed(scrape)(URL) for URL in URL_list)
    for driver in drivers.values():
        driver.quit()
    

    But I don't realy think you get profit in using n_job more than you have CPUs. So n_jobs=-1 is the best (of course I may be wrong, try it).