Search code examples
amazon-web-servicesamazon-kinesis

Amazon Kinesis 1 MB size limit workaround


As reported in the AWS documentation

The maximum size of the data payload of a record before base64-encoding is up to 1 MiB.

Since that I need to process records that may have the size larger than 1 MB this limit may be an issue.

Is there any workaround to overcome this limit? And in case any proven solution already implemented and used by anyone? (I'd like avoiding "reinventing the wheel")


Solution

  • You have two choices: break the payload into multiple records or save it outside the stream, for example in S3.

    For the first option, you can utilize PartitionKey and SequenceNumberForOrdering (doc). Assign a unique partition key (such as a UUID) to each source record. If you need to break the source into sub-1MB chunks, you set the sequence number for chunks 2..N to the returned sequence number of the previous chunk.

    This will then require the clients to examine the partition key for retrieved records, and reconstruct the original record if necessary. Note that they may need to buffer several chunks (for different source records).

    Externalizing the data will simplify both the producer and consumer code. Again, create a unique identifier for each source record, but rather than writing the record to the stream write it to S3 with that identifier as its key. Then write the key to the stream. The consumer will then retrieve the actual data from S3 when it reads the ID from the stream.

    This second approach does require more management: you will need to add a lifecycle rule to S3 to delete the records, and you'll need to ensure that this life-cycle rule lets the objects live at least as long as the stream's retention period (I would probably set an 8 day TTL regardless of stream retention period, because S3 is cheap).

    If you only have infrequent large records, and especially if you have lots of small records, then writing everything to S3 will be inefficient. In that case you can adopt a hybrid model, in which you write a data structure to the stream that either contains the actual data or a reference to external storage.