apache-kafka apache-kafka-streams apache-kafka-connect

Where to run Kafka stream processor?

I'm playing around with Apache Kafka a bit and have a functional multi-node cluster configured. I want to now introduce a Kafka Stream Processor. I'll just do something simple, but here's my question: Where do I run it? I know I can run it as a standalone jar on any machine, but is that the correct place to run it? Do I run it on a worker node? Can I run it via the distributed Kafka Connect worker API? I saw documentation that says multiple instances of the same processor will be aware of each other....how? Is that handled in the Java Kafka libraries behind the scenes?

Basically, how do I deploy a processor at scale? Presumably I wouldn't manually start 10 (or 100 or 1000) instances of the same processor.

Assume I am NOT using Kubernetes for this, please. Also assume I am using the community-only packages for the Confluent Platform.

Solution

Kafka Connect does not run Kafka Streams applications.

ksqlDB, on the other hand, offers an abstraction layer for Kafka Streams applications and offers an embedded Connect worker.

Otherwise, yes, you simply run the Kafka Streams JAR files, anywhere that has network access to your Kafka cluster. Ideally, not on the cluster itself as it'll be competing for RAM and disk space.

And none of the above require Confluent Platform.

how do I deploy a processor at scale? Presumably I wouldn't manually start 10 (or 100 or 1000) instances of the same processor.

Well, you can only have up-to the number of partitions for your processor's input topics active threads, which you control by num.stream.threads and number of Streams processes.

If you're not deploying into Kubernetes, then you can still use other options like Puppet, Ansible, Supervisor, Hashicorp Nomad's Java Driver, etc.