I have a python program which uses a lot of pandas
and sklearn
computing. It basically iterates over a dataframe and makes calculus.
The code uses the map
function of the multiprocessing
module. It also uses some sklearn
models with n_jobs = -1
.
It needs 1 TERA RAM and 100 cores to run. Sadly, the bigger machine I can launch in cloud providers is more about 16 cores and 100Go Ram.
Is there a simple way to adapt my python script to run it on a cluster of machine or something simimilar in order to deal with the computation?
I don't want to rewrite everything in Spark if I don't have to.
You can take a look at Celery.
The project focuses into solving your problem.
The execution units, called tasks, are executed concurrently on a single or more worker servers...