amazon-web-services amazon-s3 amazon-redshift amazon-kinesis-firehose

How to load data into Redshift from a custom REST API

I am new to AWS and please forgive me if this question is asked previously.

I have a REST API which returns 2 parameters (name, email). I want to load this data into Redshift.

I thought of making a Lambda function which starts every 2 minutes and call the REST API. The API might return max 3-4 records within this 2 minutes.

So, under this situation is it okay to just do a insert operation or I have to still use COPY (using S3)? I am worried only about performance and error-free (robust) data insert.

Also, the Lambda function will start asynchronously every 2 mins, so there might be a overlap of insert operation (but there won't be an overlap in data).

At this situation and if I go with S3 option, I am worried the S3 file generated by previous Lambda invoke will be overwritten and a conflict occurs.

Long story short, what is the best practise to insert fewer records into redshift?

PS: I am okay with using other AWS components as well. I even looked into Firehose which is perfect for me but it can't load data into Private Subnet Redshift.

Thanks all in advance

Solution

Yes, it would be fine to INSERT small amounts of data.

The recommendation to always load via a COPY command is for large amounts of data because COPY loads are parallelized across multiple nodes. However, for just a few lines, you can use INSERT without feeling guilty.

If your SORTKEY is a timestamp and you are loading data in time order, there is also less need to perform a VACUUM, since the data is already sorted. However, it is good practice to still VACUUM the table regularly if rows are being deleted.