Search code examples
apache-sparkevent-handlingspark-streamingmicroservices

Spark streaming as event processing/handling solution (micro-services)


Spark batch processing is bringing a lot of value to our business since it's very easy to scale horizontally (we use AWS EMR with YARN).

However, a new challenge has arisen as our newest proprietary solution follows a micro-service architecture. So far, there are ~230 micro-services acting as a producer where the events are stored in Kafka (that means ~230 Kafka topics).

Although we've managed to validate the use of Spark Streaming as event processing to build the lastest state of the objects, am I right saying that I'll need one spark streaming app per Kafka topic (so, ~230 apps)?

If so, our cluster with 48 vCPU and 192GiB of Memory can only handle 52 stream processing apps concurrently. That sounds way too little since those apps (which need to run 24 hours) don't do much as they simply pull events every 5 seconds and perform CRUD operations against our Data Store.

Am I miss using Spark streaming? What other approach or framework would you take/use?


Solution

  • That doesn't sound right, you don't need 230 topics for your microservices and you don't need 230 spark streaming apps, you will however use 1 task per partition which means you'll need 230*(partitions per topic) cores to run the 230 or 1 app you decide to build, note it depends on the traffic but your best choice may be to have only 1 topic or a smaller set of topics, filter on consumption. You can subscribe to any amount of topics. As far as what to use for building state stores, you can look at kafka streams or akka streams. I wouldn't recommend spark streaming for production applications at all (this statement qualifies as opinionated). Akka streams is the easiest API to use IMO, you may need to code your store and API on top of it.

    edit: use fs2-kafka or zio-kafka