I have Lambda function that direct put's JSON strings to a Firehose stream to deliver batches of records to S3, and I wish to deliver these records as compressed .gz
files.
However, despite having Destination settings > Compression for data records
for the stream set to GZIP
, the files are delivered in plaintext even though they even get assigned a .gz
extension. I can tell this because a) I can download the file from S3 and it opens as text with no modification and b) gzip -d ~/path/my_file.gz
returns gzip: /path/my_file.gz: not in gzip format
Why would Firehose deliver the data uncompressed even though compression is enabled? Am I missing something?
Code:
Lambda:
import json
import boto3
firehose = boto3.client("firehose")
record = {'field_1': 'test'} # dict/json
record_string = json.dumps(record) + '\n' # Firehose expects ndjson
response = firehose.put_record(
DeliveryStreamName=my_stream_name,
Record={ 'Data': record_string }
)
Firehose (Terraform):
resource "aws_kinesis_firehose_delivery_stream" "my_firehose_stream" {
name = my_stream_name
destination = "extended_s3"
extended_s3_configuration {
role_arn = my_role_arn
bucket_arn = my_bucket_arn
prefix = "my_prefix/!{partitionKeyFromQuery:extracted}/"
error_output_prefix = "my_error_prefix/"
buffering_size = 64 # MB
buffering_interval = 900 # seconds
compression_format = "GZIP" # Compress as GZIP
# Enabled to dynamic extract
processing_configuration {
enabled = true
processors {
type = "MetadataExtraction"
parameters {
parameter_name = "JsonParsingEngine"
parameter_value = "JQ-1.6"
}
parameters {
parameter_name = "MetadataExtractionQuery"
parameter_value = "{extracted:.extracted}"
}
}
}
dynamic_partitioning_configuration {
enabled = true
}
}
}
If you are downloading the file via a web browser, it is possible that the browser is auto-decompressing the file because browsers know how to handle web pages that are gzip-compressed.
To fully test what is happening, you should download the file via the AWS CLI and then check the file contents.
You could also compare the size of the file shown in S3 vs the size on your local disk.