I have a long running Python script that takes 1-2 hours to complete. It's running on a 4gb container with 1 CPU.
The script fetches and processes data in a for loop. Something like the following:
for i in ENOUGH_API_CALLS_TO_TAKE_2_HOURS:
data = fetch_data()
process_data(data)
The 4gb container crashes halfway through script execution due to lack of memory. There's no way any individual API call comes close to retrieving 4gb of data though.
After using tracemalloc
to debug, I think Python is slowly eating memory on each API call without releasing it back to the OS. Eventually crashing the process by exceeding memory limits.
I've read threads that discuss using multiprocessing to ensure memory gets released when tasks complete. But here I only have 1 CPU so I don't have a second processor to work with.
Is there any other way to release memory back to the OS from inside my main thread?
Note I've tried gc.collect()
without any success.
Multiprocessing does not require you to have multiple physical or logical CPUs. If you look at the task manager on your PC, there are almost certainly more processes running than you have cores or threads.
In this case, your single processor can only be actively working on one task at a time, but it can switch back and forth with a little overhead. This will perhaps extend the overall runtime a bit, but it does solve the problem about eating up resources.
Have you verified that you are actually running out of memory (e.g. by checking logs on your container, or watching the memory usage of your python process in real time)? If you're not sure, it might be valuable to spend time confirming that this is the problem before digging into updating the code into something that will likely be slower (due to the overhead of spawning child processes).