I'm working on building a redshift database by listening to events from from different sources and pump that data into a redshift cluster.
The idea is to use Kinesis firehose to pump data to redshift using COPY command. But I have a dilemma here: I wish to first query some information from redshift using a select query such as the one below:
select A, B, C from redshift__table where D='x' and E = 'y';
After getting the required information from redshift, I will combine that information with my event notification data and issue a request to kinesis. Kinesis will then do its job and issue the required COPY command.
Now my question is that is it a good idea to repeatedly query redshift like say every second since that is the expected time after which I will get event notifications?
Now let me describe an alternate scenario:
If I normalize my table and separate out some fields into a separate table then, I will have to perform fewer redshift queries with the normalized design (may be once every 30 seconds)
But the downside of this approach is that once I have the data into redshift, I will have to carry out table joins while performing real time analytics on my redshift data.
So I wish to know on a high level which approach would be better:
Have a single flat table but query it before issuing a request to kinesis on an event notification. There wont be any table joins while performing analytics.
Have 2 tables and query redshift less often. But perform a table join while displaying results using BI/analytical tools.
Which of these 2 do you think is a better option? Let us assume that I will use appropriate sort keys/distribution keys in either cases.
I'd definitely go with your second option, which involves JOINing with queries. That's what Amazon Redshift is good at doing (especially if you have your SORTKEY and DISTKEY set correctly).
Let the streaming data come into Redshift in the most efficient manner possible, then join when doing queries. You'll have a lot less queries that way.
Alternatively, you could run a regular job (eg hourly) to batch process the data into a wide table. It depends how quickly you'll need to query the data after loading.