python pandas apache-spark multiprocessing multicore

Multicore python as an alternative to spark

I have a python program which uses a lot of pandas and sklearn computing. It basically iterates over a dataframe and makes calculus. The code uses the map function of the multiprocessing module. It also uses some sklearn models with n_jobs = -1.

It needs 1 TERA RAM and 100 cores to run. Sadly, the bigger machine I can launch in cloud providers is more about 16 cores and 100Go Ram.

Is there a simple way to adapt my python script to run it on a cluster of machine or something simimilar in order to deal with the computation?

I don't want to rewrite everything in Spark if I don't have to.

Solution

You can take a look at Celery.

The project focuses into solving your problem.

The execution units, called tasks, are executed concurrently on a single or more worker servers...