Search code examples
apache-kafkaapache-kafka-streams

Kafka Streams - Relation between "Stream threads" vs "Tasks" running on 1 C4.XLarge Machine


I have a Kafka Streams Topology which has 5 Processors & 1 Source. Source topic for this topology has 200 partitions. My understanding is 200 tasks get created to match # of partitions for input topic.

This Kafka Streams app is running on C4.XLarge & these 200 tasks run on single stream thread which means this streams thread should be using up all the CPU Cores (8) & memory.

I know Kafka streams parallelism/scalability is controlled by number of stream threads. I can increase the num.stream.threads to 10, but how would it improve the performance if all of them run on single EC2 instance ?. How would it differ from running all tasks on single stream thread which is on single EC2 instance ?.


Solution

  • If you have a 8 core machine, you might want to run 8 StreamsThreads.

    This Kafka Streams app is running on C4.XLarge & these 200 tasks run on single stream thread which means this streams thread should be using up all the CPU Cores (8) & memory.

    This does not sound correct. A single thread cannot utilize multiple cores. While configuring a single StreamThread implies that some more other background threads are started (consumer heartbeat thread; producer sender thread), it would assume that you cannot fully utilize all 8 cores with this setting.

    If 8 StreamsThreads do not fully utilize your 8 cores you might consider to configure 16 thread. However note, that all thread will share the same network and thus, if the network is the actually limiting factor, running more threads won't give you higher throughput (or higher CPU utilization). For this case, you need to scale out using multiple EC2 instances.

    Given that you have 200 tasks, you could conceptually run up to 200 StreamThreads but you probably own't need 200 threads.