amazon-web-services amazon-ec2 amazon-s3 emr

How to execute AWS emr and redshift scripts?

I have files in S3 folder where I have to use PiG scripts from EMR to do transformation and then load back to s3.

After that I have created tables in redshift which are loaded from S3.

Currently, I have used SQL work bench to load files from s3 and also I have executed pig script from AWS GUI window.

I would like to know how can I call the pig scripts from unix shell? How can I execute the redshift scripts apart from sqlworkbench? How can I run them sequentially?

Do I need to have EC2 Linux setup to connect to EMR? Note: I have an Windows EC2 instance also.

Solution

First you need to use EMR launcher code, you can use amazon cli or amazon java SDK to do. Using this you can launch EMR job.

You can use amazon EMR console as well to create cluster. Please select step pig program and give path for your pig script which is S3. Specify in path for input location in S3 and output location to s3. Launch the Job.

Once the Job is over it will write output to s3.

Once the job is done upon success for job , launch the script( python, shell or java code) to trigger copy command. This script should connect to your redshift cluster, copy the processed out from S3 to redshift table .

You can connect EMR and redshift from your local machine or you can use EC2 to trigger your EMR launcher and redshift loader scripts.