Search code examples
apache-kafkakafka-topickafka-partition

Splitting Kafka into separate topic or single topic/multiple partitions


As usual, it's bit confusing to see benefits of splitting methods over others.

  1. I can't see the difference/Pros-Cons between having
    • Topic1 -> P0 and Topic 2 -> P0
    • over Topic 1 -> P0, P1
      and a consumer pull from 2 topics or single topic/2 partitions, while P0 and P1 will hold different event types or entities.

Thee only benefit I can see if another consumer needs Topic 2 data then it's easy to consume

  1. Regarding topic auto generation, any benefits behind that way or it will be out of hand after some time?

Thanks


Solution

    1. I would say this decision depends on multiple factors;

      • Logic/Separation of Concerns: You can decide whether to use multiple topics over multiple partitions based on the logic you are trying to implement. Normally, you need distinct topics for distinct entities. For example, say you want to stream users and companies. It doesn't make much sense to create a single topic with two partitions where the first partition holds users and the second one holds the companies. Also, having a single topic for multiple partitions won't allow you to implement e.g. message ordering for users that can only be achieved using keyed messages (message with the same key are placed in the same partition).

      • Host storage capabilities: A partition must fit in the storage of the host machine while a topic can be distributed across the whole Kafka Cluster by partitioning it across multiple partitions. Kafka Docs can shed some more light on this:

        The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. Second they act as the unit of parallelism—more on that in a bit.

      • Throughput: If you have high throughput, it makes more sense to create different topics per entity and split them into multiple partitions so that multiple consumers can join the consumer group. Don't forget that the level of parallelism in Kafka is defined by the number of partitions (and obviously active consumers).

      • Retention Policy: Message retention in Kafka works on partition/segment level and you need to make sure that the partitioning you've made in conjunction with the desired retention policy you've picked will support your use case.

    2. Coming to your second question now, I am not sure what is your requirement and how this question relates to the first one. When a producer attempts to write a message to a Kafka topic that does not exist, it will automatically create that topic when auto.create.topics.enable is set to true. Otherwise, the topic won't get created and your producer will fail.

      auto.create.topics.enable: Enable auto creation of topic on the server

    Again, this decision should be dependent on your requirements and the desired behaviour. Normally, auto.create.topics.enable should be set to false in production environments in order to mitigate any risks.