When I create Kinesis Data Firehose stream, there are 2 options for the Source,
What are the advantage and disadvantage of these options?
They serve different purposes. But if your aim is only to inject records for storing (and transforming optionally) in S3, Redshift or ElasticSearch, then the main difference is simplicity.
Direct PUT or other sources
Allows for direct "manual" injection of records into firehose
. For the ingestion, you or your application have to use put-record or put-record-batch.
These api calls are very simple and straightforward to use, in a sense you don't need to manage records partitioning. Because you just provide them with firehose
name and the record(s) to be written. Nothing else is reacquired.
Also firehose
is basically serverless, thus you do not need to manage its scaling or provision its throughput. Its all done automatically for you.
However, firehose
is not completely "real-time". Due to its timeout and buffering your records always get delayed.
Kinesis Data Stream
If you front your firehose
with kinesis stream
, then you have to inject records to the stream. For that you use put-record and or put-records. If you look at these api calls, they are more complicated as you have to manage key partitioning
yourself. You have to do it correctly, as otherwise you end up with hot/cold shards and worries how to fix that.
Also data streams
are not serverless in a sense that they do not autoscale. You have to manage their throughput yourself. This means that you have to calculate and provision the number of shards you require. If you do it incorrectly, you will have issues.
Conclusions
Choose direct put to firehose
if you only aim at storing (transforming) your records in supported storage destinations.
Choose to use kinesis data stream in front of firehose
if you require not only storing, but also doing other things with your records in real-time. This is because you can have other stream consumers than firehose
which do require real-time data.