Search code examples
amazon-web-servicesamazon-s3amazon-redshiftamazon-kinesis-firehose

How to load data into Redshift from a custom REST API


I am new to AWS and please forgive me if this question is asked previously.

I have a REST API which returns 2 parameters (name, email). I want to load this data into Redshift.

I thought of making a Lambda function which starts every 2 minutes and call the REST API. The API might return max 3-4 records within this 2 minutes.

So, under this situation is it okay to just do a insert operation or I have to still use COPY (using S3)? I am worried only about performance and error-free (robust) data insert.

Also, the Lambda function will start asynchronously every 2 mins, so there might be a overlap of insert operation (but there won't be an overlap in data).

At this situation and if I go with S3 option, I am worried the S3 file generated by previous Lambda invoke will be overwritten and a conflict occurs.

Long story short, what is the best practise to insert fewer records into redshift?

PS: I am okay with using other AWS components as well. I even looked into Firehose which is perfect for me but it can't load data into Private Subnet Redshift.

Thanks all in advance


Solution

  • Yes, it would be fine to INSERT small amounts of data.

    The recommendation to always load via a COPY command is for large amounts of data because COPY loads are parallelized across multiple nodes. However, for just a few lines, you can use INSERT without feeling guilty.

    If your SORTKEY is a timestamp and you are loading data in time order, there is also less need to perform a VACUUM, since the data is already sorted. However, it is good practice to still VACUUM the table regularly if rows are being deleted.