Search code examples
pythonapache-sparkcelerydistributedjobs

For distributing calculation task, which is better celery or spark


Problem: calculation task can be paralleled easily. but it is needed real-time response.

There can be two approaches. 1. using Celery: runs job in parallel from scratch 2. using Spark: runs job in parallel with spark framework

I think spark is better in scalability perspective. But is it OK Spark as backend of web-application?


Solution

  • Celery :- is really a good technology for distributed streaming And its supports Python language . Which is it self strong in computation and easy to write. The streaming application in Celery supports so many features as well . Its little over head on CPU.

    Spark- Its supports various programming language Java,Scala,Python. its not pure streaming its micro batch streaming as per the Spark documentation

    If your task can only be full filled by streaming and you dont need the SQl like feature . Then Celery will be the best. But you need various feature along with streaming then SPark will be better . In that case you can take scenario you application will generate the data in how many batches within second .