Kinesis Data Firehose source `Direct PUT` vs `Kinesis Data Stream`

When I create Kinesis Data Firehose stream, there are 2 options for the Source,

Direct PUT or other sources
Kinesis Data Stream

What are the advantage and disadvantage of these options?

Solution

They serve different purposes. But if your aim is only to inject records for storing (and transforming optionally) in S3, Redshift or ElasticSearch, then the main difference is simplicity.

Direct PUT or other sources

Allows for direct "manual" injection of records into firehose. For the ingestion, you or your application have to use put-record or put-record-batch.

These api calls are very simple and straightforward to use, in a sense you don't need to manage records partitioning. Because you just provide them with firehose name and the record(s) to be written. Nothing else is reacquired.

Also firehose is basically serverless, thus you do not need to manage its scaling or provision its throughput. Its all done automatically for you.

However, firehose is not completely "real-time". Due to its timeout and buffering your records always get delayed.

Kinesis Data Stream

If you front your firehose with kinesis stream, then you have to inject records to the stream. For that you use put-record and or put-records. If you look at these api calls, they are more complicated as you have to manage key partitioning yourself. You have to do it correctly, as otherwise you end up with hot/cold shards and worries how to fix that.

Also data streams are not serverless in a sense that they do not autoscale. You have to manage their throughput yourself. This means that you have to calculate and provision the number of shards you require. If you do it incorrectly, you will have issues.

Conclusions

Choose direct put to firehose if you only aim at storing (transforming) your records in supported storage destinations.

Choose to use kinesis data stream in front of firehose if you require not only storing, but also doing other things with your records in real-time. This is because you can have other stream consumers than firehose which do require real-time data.