I've been working on a project where I've been storing the iot data in s3 bucket and batching them using aws kinesis firehose, i have a lambda function running on the delivery stream where i convert the epoch milliseconds time to proper timestamp having date and time. here is my sample JSON payload
"time":"17-01-2023 10:49:09"
I now want to convert these files in s3 to parquet files and then do processing on them using apache pyspark. What is the best way to do so? Should I use kinesis firehose itself where it provides the functionality to convert the data into parquet format, or should i go with aws glue jobs. Both the services does the same thing. what is the difference between both? Which approach should I follow?
Any help will be greatly appreciated.
Best way is to use native parquet conversion as part of firehose.
Firehose has an option (Convert record format - Enable it) to convert to parquet or Orc format before delivering them to S3