Search code examples
pythonlistpython-multiprocessing

Python Multiprocessing Manager - List Name Error?


I am trying to use a Shared List that will update scraped information from Selenium so i can later export this info or use it how i chose. For some reason it is giving me this error: NameError: name 'scrapedinfo' is not defined...

This is really strange to me because i declared the list Global AND I used the multiprocessing.Manager() to create the list. I have double checked my code many times and it is not a case sensitive error. I also tried to past the list through the functions as a variable but this created other problems and did not work. Any help is greatly appreciated!

from selenium import webdriver
from multiprocessing import Pool

def browser():  
    driver = webdriver.Chrome()
    return driver

def test_func(link):
    driver = browser()
    driver.get(link)

def scrape_stuff(driver):

    #Scrape things
    scrapedinfo.append(#Scraped Stuff)

def multip():
    manager = Manager()

    #Declare list here

    global scrapedinfo
    scrapedinfo = manager.list()

    links = ["https://stackoverflow.com/", "https://signup.microsoft.com/", "www.example.com"]
    chunks = [links[i::3] for i in range(3)]
    pool = Pool(processes=3)
    pool.map(test_func, chunks)
    print(scrapedinfo)

multip()

Solution

  • In Windows, multiprocessing executes a new python process and then tries to pickle/unpickle a limited view of the parent state for the child. Global variables that are not passed in the map call are not included. scrapedinfo is not created in the child and you get the error.

    One solution is to pass scrapedinfo in the map call. Hacking down to a quick example,

    from multiprocessing import Pool, Manager
    
    def test_func(param):
        scrapedinfo, link = param
        scrapedinfo.append("i scraped stuff from " + str(link))
    
    def multip():
        manager = Manager()
    
        global scrapedinfo
        scrapedinfo = manager.list()
    
        links = ["https://stackoverflow.com/", "https://signup.microsoft.com/", "www.example.com"]
        chunks = [links[i::3] for i in range(3)]
        pool = Pool(processes=3)
        pool.map(test_func, list((scrapedinfo, chunk) for chunk in chunks))
        print(scrapedinfo)
    
    if __name__=="__main__":
        multip()
    

    But you are doing more work than you need to with the Manager. map passes the worker's return value back to the parent process (and handles chunking). So you could do:

    from multiprocessing import Pool, Manager
    
    def test_func(link):
        return "i scraped stuff from " + link
    
    def multip():
        links = ["https://stackoverflow.com/", "https://signup.microsoft.com/", "www.example.com"]
        pool = Pool(processes=3)
        scrapedinfo = pool.map(test_func, links)
        print(scrapedinfo)
    
    if __name__=="__main__":
        multip()
    

    And avoid the extra processsing of a clunky list proxy.