Search code examples
pythonpandasapache-sparkmultiprocessingmulticore

Multicore python as an alternative to spark


I have a python program which uses a lot of pandas and sklearn computing. It basically iterates over a dataframe and makes calculus. The code uses the map function of the multiprocessing module. It also uses some sklearn models with n_jobs = -1.

It needs 1 TERA RAM and 100 cores to run. Sadly, the bigger machine I can launch in cloud providers is more about 16 cores and 100Go Ram.

Is there a simple way to adapt my python script to run it on a cluster of machine or something simimilar in order to deal with the computation?

I don't want to rewrite everything in Spark if I don't have to.


Solution

  • You can take a look at Celery.

    The project focuses into solving your problem.

    The execution units, called tasks, are executed concurrently on a single or more worker servers...