amazon-s3 apache-kafka amazon-redshift amazon-kinesis-firehose

Streaming from Oracle Tables to Redshift

I am new to Redshift and AWS eco-system. I am looking for options or best practices for streaming data changes from Oracle EBS tables on premise to Redshift.

Should S3 be used as the staging area? Thus Oracle->S3->Redshift? Is that good for real-time streaming Oracle tables to Redshift?
Any way to bypass S3 staging and do Oracle->AWS Kinesis(Firehose)->Redshift? If so, are there such scenarios I can read-up on?
What about using Kafka instead of AWS Kinesis?
Can AWS Kinesis or Kafka pull directly from an on-premise Oracle instance?
Are there other alternatives/components, ETL tools for near real-time or almost real-time data load to Redshift?

There is a large number of tables to stream from Oracle, which is on-prem. I am new to Redshift but familiar with Oracle, SQL Server, PG. Sorry if I am totally of beat here.

Please help :) Any thoughts and/or references would be highly appreciated...

Solution

As per docs here, 1 and 2 are the same, essentially. You won't bypass S3 by using firehose, just mask it. And Firehose is currently useless if you have lots of tables on more than one cluster. Unless of course you plan to automate the process of sending support requests to AWS Support to increase limits (I was thinking about it, don't laugh).

I'd go for loading using COPY command from S3.

Inserts are currently slow, and I mean SLOW. Do not use methods that would generate insert statements under the hood.

My use cases: Apache Storm streaming events to Redshift, using S3 as staging area. Works fine for hundreds of thousands of events per table per day, dozens of tables per database, several databases per cluster, couple of clusters. We use API Gateway, AWS Lambda and S3 as staging area for second process. Works just as well for tens of thousands of events per day, couple of different clusters, several databases on each cluster, one table is loaded this way in each database.

You can, in theory, issue COPY command using SSH, but then you have to prepare manifest files on (wait for it) ... S3. So I have no idea why you wouldn't use S3 for staging data storage anyway.

As for streaming data from on-premises Oracle to S3, it's a whole different topic entirely, and you should look for answers from someone proficient with Oracle. I'd look at CDC, but I'm not an Oracle pro, so can't tell if this is a good approach.

I hope this helps.