Search code examples
pythonpython-multiprocessing

List of lists after multiprocessing not usable outside the context manager


I've optimized my code to use multiple cores using multiprocessing:

import pandas as pd
import requests
from multiprocessing import Process
from multiprocessing import Manager

url1 = "https://api.unverpackt-verband.de/map"
# url2: "https://api.unverpackt-verband.de/map/info/" + id
url2 = "https://api.unverpackt-verband.de/map/info/"


headers = {"Accept": "application/json, text/plain, */*",
               "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:103.0) Gecko/20100101 Firefox/103.0"
               }

def get_req(url, append_to_url=""):
    req = requests.get(url, headers=headers).json()

    return req

def get_dataildaten(id, managed_list):
    # link zusammenbauen
    link = "https://api.unverpackt-verband.de/map/info/" + id
    # request bauen
    detail_req = requests.get(link, headers=headers)
    # an shared liste anhängen
    managed_list.append(pd.json_normalize(detail_req.json()).values.tolist()[0])




# erzeuge basidaten df
df_basisdaten = pd.json_normalize(get_req(url1, append_to_url=""))\
                  .astype({"id": "string"})

print(df_basisdaten.head())
print(df_basisdaten.shape)

# erzeuge detaildaten

ids = [x for x in df_basisdaten["id"]]
lst = []

# protect the entry point
if __name__ == "__main__":
    # manager erstellen
    with Manager() as manager:
        # die geteilte liste erstellen
        managed_lst = manager.list()
        # viele child prozesse erstellen
        processes = [Process(target=get_dataildaten,
                             args=(id, managed_lst)) for id in ids]
        # alle prozesse starten
        for process in processes:
            process.start()
        # warte bis alle prozesse beendet sind
        for process in processes:
            process.join()
        print(managed_lst) # i want to use this managed_lst to create a pandas df

I can print the list, but can't use it outside of Context "manager".

With a function one could work possibly with a return, in order to make the list also outside available.

My question: What do I have to adjust so I can use managed_lst also outside the indentation level under if?


Solution

  • The managed list is no longer valid when the manager closes. You should copy any information from the manager that is needed after close. Managed lists are pretty expensive to access so copying to a regular list after the management is no longer needed is a good idea.

    When posting questions on SO, its good to remove irrelevant code and focus just on the problem. Its easier for us to debug and often times it helps solve the problem before even posting. Here is a smaller script that would have the same problem except for the marked line where the list is copied.

    from multiprocessing import Process
    from multiprocessing import Manager
    import os
    
    def worker(managed_lst):
        managed_lst.append(os.getpid())
    
    if __name__ == "__main__":
        with Manager() as manager:
            managed_lst = manager.list()
            for _ in range(4):
                processes = [Process(target=worker, args=(managed_lst,))
                        for _ in range(4)]
            for p in processes:
                p.start()
            for p in processes:
                p.join()
            del processes
            print(type(managed_lst), managed_lst)
            managed_lst = managed_lst[:]  # <=== exception if this line removed
        print(type(managed_lst), managed_lst)