I am trying to use a Shared List that will update scraped information from Selenium so i can later export this info or use it how i chose. For some reason it is giving me this error: NameError: name 'scrapedinfo' is not defined...
This is really strange to me because i declared the list Global AND I used the multiprocessing.Manager() to create the list. I have double checked my code many times and it is not a case sensitive error. I also tried to past the list through the functions as a variable but this created other problems and did not work. Any help is greatly appreciated!
from selenium import webdriver
from multiprocessing import Pool
def browser():
driver = webdriver.Chrome()
return driver
def test_func(link):
driver = browser()
driver.get(link)
def scrape_stuff(driver):
#Scrape things
scrapedinfo.append(#Scraped Stuff)
def multip():
manager = Manager()
#Declare list here
global scrapedinfo
scrapedinfo = manager.list()
links = ["https://stackoverflow.com/", "https://signup.microsoft.com/", "www.example.com"]
chunks = [links[i::3] for i in range(3)]
pool = Pool(processes=3)
pool.map(test_func, chunks)
print(scrapedinfo)
multip()
In Windows, multiprocessing executes a new python process and then tries to pickle/unpickle a limited view of the parent state for the child. Global variables that are not passed in the map
call are not included. scrapedinfo
is not created in the child and you get the error.
One solution is to pass scrapedinfo
in the map call. Hacking down to a quick example,
from multiprocessing import Pool, Manager
def test_func(param):
scrapedinfo, link = param
scrapedinfo.append("i scraped stuff from " + str(link))
def multip():
manager = Manager()
global scrapedinfo
scrapedinfo = manager.list()
links = ["https://stackoverflow.com/", "https://signup.microsoft.com/", "www.example.com"]
chunks = [links[i::3] for i in range(3)]
pool = Pool(processes=3)
pool.map(test_func, list((scrapedinfo, chunk) for chunk in chunks))
print(scrapedinfo)
if __name__=="__main__":
multip()
But you are doing more work than you need to with the Manager. map
passes the worker's return value back to the parent process (and handles chunking). So you could do:
from multiprocessing import Pool, Manager
def test_func(link):
return "i scraped stuff from " + link
def multip():
links = ["https://stackoverflow.com/", "https://signup.microsoft.com/", "www.example.com"]
pool = Pool(processes=3)
scrapedinfo = pool.map(test_func, links)
print(scrapedinfo)
if __name__=="__main__":
multip()
And avoid the extra processsing of a clunky list proxy.