In designing a streaming processing pipeline what cost might be incurred if I were to have many topics which would have at least one partition but potentially no data going into it?
As an example, with one consumer and I could choose to have one "mega topic" which contains all of the data and many partitions or I could choose to split that data (by tenant, account, or user etc.) into many topics with, by default, a single partition. My worry about the second case is that there would be many topics/partitions which would see no data. So, is this unused partition costing anything or is there no cost that is incurred by an unused topic.
First of all, there is no difference between one fat topic and lots of partitions and more than one topic that contains a few partitions. Topic is just for logical distinction between events. Kafka only cares about number of partitions.
Secondly, having lots of partitions can lead some problems:
Each partition maps to a directory in the file system in the broker. Within that log directory, there will be two files (one for the index and another for the actual data) per log segment.
Brokers allocate a buffer the size of replica.fetch.max.bytes for each partition they replicate. If replica.fetch.max.bytes is set to 1 MiB, and you have 1000 partitions, about 1 GiB of RAM is required.
If a broker which is controller is failed, then zookeeper elect another broker as controller. At that point newly elected broker should read metadata for every partition from Zookeeper during initialization.
For example, if there are 10,000 partitions in the Kafka cluster and initializing the metadata from ZooKeeper takes 2 ms per partition, this can add 20 more seconds to the unavailability window.
You may get more information from these links:
https://www.confluent.io/blog/how-choose-number-topics-partitions-kafka-cluster/
https://docs.cloudera.com/documentation/kafka/latest/topics/kafka_performance.html