I've optimized my code to use multiple cores using multiprocessing
:
import pandas as pd
import requests
from multiprocessing import Process
from multiprocessing import Manager
url1 = "https://api.unverpackt-verband.de/map"
# url2: "https://api.unverpackt-verband.de/map/info/" + id
url2 = "https://api.unverpackt-verband.de/map/info/"
headers = {"Accept": "application/json, text/plain, */*",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:103.0) Gecko/20100101 Firefox/103.0"
}
def get_req(url, append_to_url=""):
req = requests.get(url, headers=headers).json()
return req
def get_dataildaten(id, managed_list):
# link zusammenbauen
link = "https://api.unverpackt-verband.de/map/info/" + id
# request bauen
detail_req = requests.get(link, headers=headers)
# an shared liste anhängen
managed_list.append(pd.json_normalize(detail_req.json()).values.tolist()[0])
# erzeuge basidaten df
df_basisdaten = pd.json_normalize(get_req(url1, append_to_url=""))\
.astype({"id": "string"})
print(df_basisdaten.head())
print(df_basisdaten.shape)
# erzeuge detaildaten
ids = [x for x in df_basisdaten["id"]]
lst = []
# protect the entry point
if __name__ == "__main__":
# manager erstellen
with Manager() as manager:
# die geteilte liste erstellen
managed_lst = manager.list()
# viele child prozesse erstellen
processes = [Process(target=get_dataildaten,
args=(id, managed_lst)) for id in ids]
# alle prozesse starten
for process in processes:
process.start()
# warte bis alle prozesse beendet sind
for process in processes:
process.join()
print(managed_lst) # i want to use this managed_lst to create a pandas df
I can print the list, but can't use it outside of Context "manager".
With a function one could work possibly with a return, in order to make the list also outside available.
My question: What do I have to adjust so I can use managed_lst
also outside the indentation level under if
?
The managed list is no longer valid when the manager closes. You should copy any information from the manager that is needed after close. Managed lists are pretty expensive to access so copying to a regular list after the management is no longer needed is a good idea.
When posting questions on SO, its good to remove irrelevant code and focus just on the problem. Its easier for us to debug and often times it helps solve the problem before even posting. Here is a smaller script that would have the same problem except for the marked line where the list is copied.
from multiprocessing import Process
from multiprocessing import Manager
import os
def worker(managed_lst):
managed_lst.append(os.getpid())
if __name__ == "__main__":
with Manager() as manager:
managed_lst = manager.list()
for _ in range(4):
processes = [Process(target=worker, args=(managed_lst,))
for _ in range(4)]
for p in processes:
p.start()
for p in processes:
p.join()
del processes
print(type(managed_lst), managed_lst)
managed_lst = managed_lst[:] # <=== exception if this line removed
print(type(managed_lst), managed_lst)