memory apache-kafka kafka-consumer-api kafka-producer-api

Does Kafka really need SSD disk?

We are little confused about the disks types that kafka machine needs.

In our Kafka cluster in production, we have producers, 3 kafka brokers, and consumers.

When producer push data to topics and consumer read data from topics, how to avoid the situation that consumer try to read data from topic partitions but the data not really inside the topic?

Second - since we are not use SSD disks in Kafka brokers, how to know when consumer read the data from memory cache or from the disks?

Solution

how to avoid the situation that consumer try to read data from topic partitions but the data not really inside the topic ?

Kafka reads data sequentially so there is no random access. That's why you cannot read a specific data. (you can just specify offset to read from)

Also, because there is no random access, using SSD has no significant effect on performance.

From cloudera blog (link):

Using SSDs instead of spinning disks has not been shown to provide a significant performance improvement for Kafka, for two main reasons:

Kafka writes to disk are asynchronous. That is, other than at startup/shutdown, no Kafka operation waits for a disk sync to
complete; disk syncs are always in the background. That’s why
replicating to at least three replicas is critical—because a single
replica will lose the data that has not been sync’d to disk, if it
crashes.

Each Kafka Partition is stored as a sequential write ahead log. Thus, disk reads and writes in Kafka are sequential, with very few random seeks. Sequential reads and writes are heavily optimized by modern operating systems.