amazon-web-services travis-ci amazon-emr aws-code-deploy

Continuous Integration on AWS EMR

We have a long running EMR cluster that has multiple libraries installed on it using bootstrap actions. Some of these libraries are under continuous development and their codebase is on GitHub.

I've been looking to plug Travis CI with AWS EMR in a similar way to Travis and CodeDeploy. The idea is to get the code on GitHub tested and deployed automatically to EMR while using bootstrap actions to install the updated libraries on all EMR's nodes.

A solution I came up with is to use an EC2 instance in the middle, where Travis and CodeDeploy can be first used to deploy the code on the instance. After that a lunch script on the instance is triggered to create a new EMR cluster with the updated libraries.

However, the above solution means we need to create a new EMR cluster every time we deploy a new version of the system

Any other suggestions?

Solution

You definitely don't want to maintain an EC2 instance to orchestrate a CI/CD process like that. First of all, it introduces a number of challenges because then you need to deal with an entire server instance, keep it maintained, deal with networking, apply monitoring and alerts to deal with availability issues, and even then, you won't have availability guarantees, which may cause other issues. Most of all, maintaining an EC2 instance for a purpose like that simply is unnecessary.

I recommend that you investigate using Amazon CodePipeline with a Lambda Step Function. The Step Function can be used to orchestrate the provisioning of your EMR cluster in a fully serverless environment. With CodePipeline, you can setup a web hook into your Github repo to pull your code and spin up a new deployment automatically whenever changes are committed to your master Github branch (or whatever branch you specify). You can use EMRFS to sync an S3 bucket or folder to your EMR file system for your cluster and then obtain the security benefits of IAM, as well as additional consistency guarantees that come with EMRFS. With Lambda, you also get seamless integration into other services, such as Kinesis, DynamoDB, and CloudWatch, among many others, that will simplify many administrative and development tasks, as well as enable you to have more sophisticated automation with minimal effort.

There are some great resources and tutorials for using CodePipeline with EMR, as well as in general. Here are some examples:

There are also great tutorials for orchestrating applications with Lambda Step Functions, including the use of EMR. Here are some examples:

In the very worst case, if all of those options fail, such as if you need very strict control over the startup process on the EMR cluster after the EMR cluster completes its bootstrapping, you can always create a Java JAR that is loaded as a final step and then use that to either execute a shell script or use the various Amazon Java libraries to run your provisioning commands. In even this case, you still have no need to maintain your own EC2 instance for orchestration purposes (which, in my opinion, still would be hard to justify even if it was running in a Docker container in Kubernetes) because you can easily maintain that deployment process as well with a fully serverless approach.

There are many great videos from the Amazon re:Invent conferences that you may want to watch to get a jump start before you dive into the workshops. For example:

Many more such videos are available on YouTube.

Travis CI also supports Lambda deployment, as mentioned here: https://docs.travis-ci.com/user/deployment/lambda/